Re: [zfs-discuss] Fileserver help.

2010-04-18 Thread Haudy Kazemi
Any comments on NexentaStor Community/Developer Edition vs EON for 
NAS/small server/home server usage?  It seems like Nexenta has been 
around longer or at least received more press attention.  Are there 
strong reasons to recommend one over the other?  (At one point usable 
space would have been a strong reason to use EON because NexentaStor 
Developer Edition had a small capacity limit of 1 TB, later 4 TB, but 
now it is a more usable 12 TB.)



R.G. Keen wrote:
Offhand, I'd say EON  


http://sites.google.com/site/eonstorage/

This probably the best answer right now. It will be even better when they get a web administration GUI running. Some variant of freenas on freebsd is also possible. 


Opensolaris is missing a good opportunity to expand its user base on two fronts.
1. It's *hard* to figure out what motherboards will just work, and what will 
have problems.
2. There's not a better-packaged solution to this particular question. EON is 
very, very close to solving this one.
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can RAIDZ disks be slices ?

2010-04-22 Thread Haudy Kazemi

Ian Collins wrote:

On 04/20/10 04:13 PM, Sunil wrote:

Hi,

I have a strange requirement. My pool consists of 2 500GB disks in 
stripe which I am trying to convert into a RAIDZ setup without data 
loss but I have only two additional disks: 750GB and 1TB. So, here is 
what I thought:


1. Carve a 500GB slice (A) in 750GB and 2 500GB slices (B,C) in 1TB.
2. Create a RAIDZ pool out of these 3 slices. Performance will be bad 
because of seeks in the same disk for B and C but its just temporary.
   


If the 1TB drive fails, you're buggered.  So there's not a lot of 
point setting up a raidz.
It is possible to survive failures of a single drive with multiple 
slices on it that are in the same pool.  It requires using a RAIDZ level 
equal or greater than the number of slices on that drive.  RAIDZ2 on a 1 
TB drive with two slices will survive the same as RAIDZ1 with one slice.


(I'm focusing on addressing data survival here.  Performance will be 
worse than usual, but even this impact may be mitigated by using a 
dedicated ZIL.  (Remote and cloud based data storage using remote iSCSI 
devices and local ZIL devices have been shown to have much better 
performance characteristics than would have otherwise been expected from 
a cloud based system.  See 
http://blogs.sun.com/jkshah/entry/zfs_with_cloud_storage_and  )


With RAIDZ3, you can survive the loss of one drive with 3 slices on it 
that are all in one pool.  (Of course at that point you can't handle any 
further failures.  Reliability with this kind of configuration is at 
worst equal to RAIDZ1, but likely better on average, because you can 
tolerate some specific multiple drive failure combinations that RAIDZ1 
cannot handle.  A similar comparison might be made between the 
reliability of a 4 drive RAIDZ2 pool vs. 4 drives in a stripe-mirror 
arrangement...you get similar usable space but in one case you can lose 
any 2 drives, in the other case you can lose any 1 drive and some 
combinations of 2 drives.


I shared a variation of this idea a while ago in a comment here:
http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z

A how to is below:


You may as well create a pool on the 1TB drive and copy to that.


3. zfs send | recv my current pool data into the new pool.
4. Destroy the current pool.
5. In the new pool, replace B with the 500GB disk freed by the 
destruction of the current pool.
6. Optionally, replace C with second 500GB to free up the 750GB 
completely.


   

Or use the two 500GB and the 750 GB drive for the raidz.


Option to get all drives included:
1.) move all data to 1 TB drive
2.) create RAIDZ1/RAIDZ2 pool using 2* 500 GB drives, 750 GB drive, and 
a sparse file that you delete right after the pool is created.  Your 
pool will be degraded by deleting the sparse file but will still work 
(because it is a RAIDZ).  Use RAIDZ2 if you want ZFS's protections to be 
active immediately (as you'll have 3 out of 4 devices available).

3.) move all data from 1 TB drive to RAIDZ pool
4.) replace sparse file device with 1 TB drive (or 500 GB slice of 1 TB 
drive)

5.) resilver pool

A variation on this is to create a RAIDZ2 using 2* 500 GB drives, 750 GB 
drive, and 2 sparse files.  After the data is moved from the 1 TB drive 
to the RAIDZ2, two 500 GB slices are created on the 1 TB drive.  These 2 
slices in turn are used to replace the 2 sparse files.  You'll end up 
with 3*500GB of usable space and protection from at least 1 drive 
failure (the 1 TB drive) up to 2 drive failures (any of the other 
drives).  Performance caveats of 2 slices on one drive apply.


If you like, you can later add a fifth drive relatively easily by 
replacing one of the slices with a whole drive.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can RAIDZ disks be slices ?

2010-04-23 Thread Haudy Kazemi

Sunil wrote:

If you like, you can later add a fifth drive
relatively easily by 
replacing one of the slices with a whole drive.





how does this affect my available storage if I were to replace both of those 
sparse 500GB files with a real 1TB drive? Will it be same? Or will I have 
expanded my storage? If I understand correctly, I would need to replace other 3 
drives with 1TB as well to expand beyond 3X500GB.

So, in essence I can go from 3x500GB to 3X1000GB in-place with this scheme in 
future if I have the money to upgrade all the drives to 1TB, WITHOUT needing 
any movement of data to temp? Please say yes!:-)
  


It should work to replace devices the way you describe.  The only time 
you need some temp storage space is if you want to change the 
arrangement of devices that make up the pool, e.g. to go from 
striped-mirrors to RAIDZ2, or RAIDZ1 to RAIDZ2, or some other 
combination.  If you just want to replace devices with identical or 
larger sized devices you don't need to move the data anywhere.


The capacity will expand to the lowest common denominator.  In some 
OpenSolaris builds I believe this happened automatically when all member 
devices had been upgraded.  At some point in later builds I think it was 
changed to require manual intervention to prevent problems (like the 
pool suddenly growing to fill all the new big drives when the admin 
really wanted the unused space to stay unused..say for partition/slice 
based short stroking, or when smaller drives were being kept around as 
spares.  If ZFS had the ability to shrink and use smaller devices this 
would not have been as big of a problem.


As I understand it from the documentation, replacement can happen two 
ways.  First, you can connect the replacement device to the system at 
the same time as the original device is working, and then issue the 
replace command.  I think this technique is safe, as the original device 
is still available during the replacement procedure and could be used to 
provide redundancy to the rest of the pool until the new device finishes 
resilvering.  (Does anyone know if this is really the case...i.e. if 
redundancy is preserved during the replacement operation when both 
original and new devices are connected simultaneously and both are 
functioning correctly?  One way to verify this is might be to run zfs 
replace on a non-redundant pool while both devices are connected.)


The second way is to (physically) disconnect the original device and 
connect the new device in its place.  The pool will be degraded because 
a member device is missing...if you have RAIDZ1, you have no redundancy 
remaining, if you have RAIDZ2, you still have 1 level of redundancy 
intact.  The zfs replace command should be able to rebuild the missing 
data onto the replacement new device.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Pool, what happen when disk failure

2010-04-23 Thread Haudy Kazemi

aneip wrote:

I really new to zfs and also raid.

I have 3 hard disk, 500GB, 1TB, 1.5TB.

On each HD i wanna create 150GB partition + remaining space.

I wanna create raidz for 3x150GB partition. This is for my document + photo.
  
You should be able to create 150 GB slices on each drive, and then 
create a RAIDZ1 out of those 3 slices.



As for the remaining I wanna create my video library. This one no need any 
redundancy since I can simply backup my dvd again.

The question would be, if I create strip pool from the remaining space (350 + 
850 + 1350 GB space). What happen if 1 of the HD failure. Do I loose some file 
of I loose the whole pool?
  
Your remaining space can be configured as slices.  These slices can be 
added directly to a second pool without any redundancy.  If any drive 
fails, that whole non-redundant pool will be lost.  Data recovery 
attempts will likely find that any recoverable video is like 
swiss-cheese, with gaps in it.  This is because files are spread across 
striped devices as they're written to increase read and write 
performance.  In a JBOD arrangement, however, some files might still be 
complete, but I don't believe ZFS supports JBOD-style non-redundant 
pools.  For most people that is not a big deal, as part of the point of 
ZFS is to focus on data integrity and performance, neither of which is 
offered by JBOD (as it is still ruined by single device failures, it is 
just that it is easier to carve files out of a JBOD than a broken RAID).


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on drives for ZIL/L2ARC?

2010-04-25 Thread Haudy Kazemi

Travis Tabbal wrote:

I have a few old drives here that I thought might help me a little, though not 
at much as a nice SSD, for those uses. I'd like to speed up NFS writes, and 
there have been some mentions that even a decent HDD can do this, though not to 
the same level a good SSD will.

The 3 drives are older LVD SCSI Cheetah drives. ST318203LW. I have 2 controllers I could use, one appears to be a RAID controller with a memory module installed. An Adaptec AAA-131U2. The memory module comes up on Google as a 2MB EDO DIMM. Not sure that's worth anything to me. :) 

The other controller is an Adaptec 29160. Looks to be a 64-bit PCI card, but the machine it came from is only 32-bit PCI, as is my current machine. 


What say the pros here? I'm concerned that the max data rate is going to be 
somewhat low with them, but the seek time should be good as they are 10K RPM (I 
think). The only reason I thought to use one for L2ARC is for dedupe. It sounds 
like L2ARC helps a lot there. This is for a home server, so all I'm really 
looking to do is speed things up a bit while I save and look for a decent SSD 
option. However, if it's a waste of time, I'd rather find out before I install 
them.
  


I'd like to hear (or see tests of) how hard drive based ZIL/L2ARC can 
help RAIDZ performance.  Examples would be large RAIDZ arrays such as:

8+ drives in a single RAIDZ1
16+ drives in a single RAIDZ2
24+ drives in a single RAIDZ3
(None of these are a series of smaller RAIDZ arrays that are striped.)

From the writings I've seen, large non-striped RAIDZ arrays tend to 
have poor performance that is more or less limited to the I/O capacity 
of a single disk.  The recommendations tend to suggest using smaller 
RAIDZ arrays and then striping them together whereby the RAIDZ provides 
redundancy and the striping provides reasonable performance.  The 
advantage of large RAIDZ arrays is you can get better protection from 
drive failure (e.g. one 16 drive RAIDZ2 can lose any 2 drives vs two 8 
drive RAIDZ1 striped arrays that can lose only one drive per array).


So what about using a few dedicated two or three way mirrored drives for 
ZIL and/or L2ARC, in combination with the large RAIDZ arrays?  The 
mirrored ZIL/L2ARC would serve as a cache to the slower RAIDZ.


One model for this configuration is the cloud based ZFS test that was 
done here which used local drives configured as ZIL and L2ARC to 
minimize the impact of cloud latency, with respectable results:

http://blogs.sun.com/jkshah/entry/zfs_with_cloud_storage_and

The performance gap between local mirrored disks used for ZIL/L2ARC and 
a large RAIDZ is not nearly as large as the gap that was addressed in 
the cloud based ZFS test.  Is the gap large enough to potentially 
benefit from HDD based mirrored ZIL/L2ARCs?  Would SSD based ZIL/L2ARCs 
be necessary to see a worthwhile performance improvement?


If this theory works out in practice,useful RAIDZ array sizes may not be 
as limited as much as they have been to date via best practices 
guidelines.  Admins may then be able to choose to have larger more 
strongly redundant RAIDZ arrays while still keeping most of the 
performance of smaller striped RAIDZ arrays by using mirrored ZIL/L2ARC 
disks or SSDs.


-hk

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Exporting iSCSI - it's still getting all the ZFS protection, right?

2010-05-07 Thread Haudy Kazemi

Brandon High wrote:

"On Mon, May 3, 2010 at 4:33 PM, Michael Shadle  wrote:
  

Is ZFS doing it's magic checksumming and whatnot on this share, even
though it is seeing junk data (NTFS on top of iSCSI...) or am I not
getting any benefits from this setup at all (besides thin
provisioning, things like that?)



The data on disk is protected, but it's not protected over the wire.

You still get snapshots, cloning, and all the other zfs features for
the dataset though.

-B
  


If someone wrote a "ZFS client", it'd be possible to get over the wire 
data protection.  This would be continuous from the client computer all 
the way to the storage device.  Right now there is data protection from 
the server to the storage device.  The best protected apps are those 
running on the same server that has mounted the ZFS pool containing the 
data they need (in which case they are protected by ZFS checksums and by 
ECC RAM, if present).


A "ZFS client" would run on the computer connecting to the ZFS server, 
in order to extend ZFS's protection and detection out across the network.


In one model, the ZFS client could be a proxy for communication between 
the client and the server running ZFS.  It would extend the filesystem 
checksumming across the network, verifying checksums locally as data was 
requested, and calculating checksums locally before data was sent that 
the server would re-check.  Recoverable checksum failures would be 
transparent except for performance loss, unrecoverable failures would be 
reported as unrecoverable using the standard OS unrecoverable checksum 
error message (Windows has one that it uses for bad sectors on drives 
and optical media).  The local client checksum calculations would be 
useful in detecting network failures, and local hardware instability.  
(I.e. if most/all clients start seeing checksum failures...look at the 
network; if only one client sees checksum failures, check that client's 
hardware.)


An extension to the ZFS client model would allow multi-level ZFS systems 
to better coordinate their protection and recover from more scenarios.  
By multi-level ZFS, I mean ZFS stacked on ZFS, say via iSCSI.  An 
example (I'm sure there are better ones) would be 3 servers, each with 3 
data disks.  Each disk is made into its own non-redundant pool (making 9 
non-redundant pools).  These pools are in turn shared via iSCSI.  One of 
the servers creates RAIDZ1 groups using 1 disk from each of the 3 servers.
With a means for ZFS systems to communicate, a failure of any 
non-redundant lower level device need not trigger a system halt of that 
lower system, because it will know from the higher level system that the 
device can be repaired/replaced using the higher level redundancy.


A key to making this happen is an interface to request a block and its 
related checksum (or if speaking of CIFS, to request a file, its related 
blocks, and their checksums.)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Moved disks to new controller - cannot import pool even after moving back

2010-05-14 Thread Haudy Kazemi
Now that you've re-imported, it seems like zpool clear may be the 
command you need, based on discussion in these links about missing and 
broken zfs logs:


http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg37554.html
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg30469.html
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6707530
http://www.sun.com/msg/ZFS-8000-6X


Jan Hellevik wrote:
Hey! It is there! :-) Cannot believe I did not try the import command 
again. :-)


But I still have problems - I had added a slice of a SSD as log and 
another slice as cache to the pool. The SSD is there - c10d1 but ...
Ideas? The log part showed under the pool when I initially tried the 
import, but now it is gone. I am afraid of doing something stupid at 
this point in time. Any help is really appreciated!


j...@opensolaris:~$ pfexec zpool import
  pool: vault
id: 8738898173956136656
 state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-6X
config:

vaultUNAVAIL  missing device
  raidz1-0   ONLINE
c11d0ONLINE
c12d0ONLINE
c12d1ONLINE
c10d1ONLINE

Additional devices are known to be part of this pool, though their
exact configuration cannot be determined.
j...@opensolaris:~$ pfexec format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
   0. c8d0 
  /p...@0,0/pci-...@14,1/i...@0/c...@0,0
   1. c10d0 
  /p...@0,0/pci-...@11/i...@0/c...@0,0
   2. c10d1 
  /p...@0,0/pci-...@11/i...@0/c...@1,0
   3. c11d0 
  /p...@0,0/pci-...@11/i...@1/c...@0,0
   4. c12d0 
  /p...@0,0/pci-...@14,1/i...@1/c...@0,0
   5. c12d1 
  /p...@0,0/pci-...@14,1/i...@1/c...@1,0
   6. c13t0d0 
  /p...@0,0/pci1022,9...@2/pci1000,3...@0/s...@0,0
   7. c13t1d0 
  /p...@0,0/pci1022,9...@2/pci1000,3...@0/s...@1,0
   8. c13t2d0 
  /p...@0,0/pci1022,9...@2/pci1000,3...@0/s...@2,0
   9. c13t3d0 
  /p...@0,0/pci1022,9...@2/pci1000,3...@0/s...@3,0
Specify disk (enter its number): ^C
j...@opensolaris:~$


On Thu, May 13, 2010 at 7:15 PM, Richard Elling 
mailto:richard.ell...@gmail.com>> wrote:


now try "zpool import" to see what it thinks the drives are
 -- richard

On May 13, 2010, at 2:46 AM, Jan Hellevik wrote:

> Short version: I moved the disks of a pool to a new controller
without exporting it first. Then I moved them back to the original
controller, but I still cannot import the pool.
>
>
> j...@opensolaris:~$ zpool status
>
>  pool: vault
> state: UNAVAIL
> status: One or more devices could not be opened.  There are
insufficient
>replicas for the pool to continue functioning.
> action: Attach the missing device and online it using 'zpool
online'.
>   see: http://www.sun.com/msg/ZFS-8000-3C
> scrub: none requested
> config:
>
>NAMESTATE READ WRITE CKSUM
>vault   UNAVAIL  0 0 0  insufficient replicas
>  raidz1-0  UNAVAIL  0 0 0  insufficient replicas
>c12d1   UNAVAIL  0 0 0  cannot open
>c12d0   UNAVAIL  0 0 0  cannot open
>c10d1   UNAVAIL  0 0 0  cannot open
>c11d0   UNAVAIL  0 0 0  cannot open
>logs
>  c10d0p1   ONLINE   0 0 0
>
> j...@opensolaris:~$ zpool status
>  cannot see the pool
>
> j...@opensolaris:~$ pfexec zpool import vault
> cannot import 'vault': one or more devices is currently unavailable
>Destroy and re-create the pool from
>a backup source.
> j...@opensolaris:~$ pfexec poweroff
>
>  moved the disks back to the original controller
>
> j...@opensolaris:~$ pfexec zpool import vault
> cannot import 'vault': one or more devices is currently unavailable
>Destroy and re-create the pool from
>a backup source.
> j...@opensolaris:~$ pfexec format
> Searching for disks...done
>
>
> j...@opensolaris:~$ uname -a
> SunOS opensolaris 5.11 snv_133 i86pc i386 i86pc Solaris
>
> j...@opensolaris:~$ pfexec zpool history vault
> cannot open 'vault': no such pool
>
>
> ... and this is where I am now.
>
> The zpool contains my digital images and videos and I would be
really unhappy to lose them. What can I do to get back the pool?
Is there hope?
>
> Sorry for the long post - tried to assemble as much relevant
information as I could.
>

--
ZFS storage and performance consulting at http://www.RichardElling.com














--
Jan Hellevik
Tel: +47-41004070
---

Re: [zfs-discuss] Moved disks to new controller - cannot import pool even after moving back

2010-05-14 Thread Haudy Kazemi
Is there any chance that the second controller wrote something onto the 
disks when it saw the disks attached to it, thus corrupting the ZFS 
drive signatures or more?


I've heard that some controllers require drives to be initialized by 
them and/or signatures written to drives by them.  Maybe your second 
controller wrote to the drives without you knowing about it.  If you 
have a pair of (small) spare drives, make a ZFS mirror out of them and 
try to recreate the problem by repeating your steps on them.


If you can recreate the problem, try to narrow it down to whether the 
problem is caused by the second controller changing things, or if the 
skipped zfs export is playing a role.  I think the skipped zfs export 
might have lead to zfs import needing to be forced (-f), but as long as 
you weren't trying to access the disks from two systems at the same time 
it shouldn't have been catastrophic.  Forcing shouldn't be necessary if 
things are being handled cleanly and correctly.


My hunch is the second controller did something when it saw the drives 
connected to it, particularly if the second controller was configured in 
RAID mode rather than JBOD or passthrough.  Or maybe you changed some 
settings on the second controller's BIOS that caused it to write to the 
drives while you were trying to get things to work?



I've seen something similar by the BIOS on a Gigabyte X38 chipset 
motherboard that has "Quad BIOS".  This is partly documented by Gigabyte at

http://www.gigabyte.com.tw/FileList/NewTech/2006_motherboard_newtech/how_does_quad_bios_work_dq6.htm

From my testing, the BIOS on this board writes a copy of itself using 
an HPA (Host Protected Area) to a hard drive for BIOS recovery purposes 
in case of a bad flashing/BIOS upgrade.  There is no prompting for the 
writing, it appears to simply happen to whichever drive was the first 
one connected to the PC, which is usually the current boot drive.  On a 
new clean disk, this would be harmless, but it risks data loss when 
reusing drives or transferring drives between systems.  This behavior is 
able to cause data loss and has affected people using Windows Dynamic 
Disks and UnRaid as can be seen by searching Google for "Gigabyte HPA".


More details:
As long as that drive is connected to the PC, the BIOS recognizes it as 
being the 'recovery' drive and doesn't write to another drive.  If that 
drive is removed, then another drive will have an HPA created on it.  
The easiest way to control this is to initially have just one drive 
connected...the one you don't mind the HPA being placed on.  Then you 
can add the other drives without them being modified.


The HPA is created on 2113 sectors at the end of the drive.  HDAT (a low 
level drive diag/repair/config utility) cannot remove this HPA while the 
drive is still the first drive (the BIOS must be enforcing protection of 
that area).  Making this drive a secondary drive by forcing the BIOS to 
create another HPA on another drive allows HDAT to remove the HPA.  
Manually examining the last 2114 (one more for good measure) sectors 
will now show that it contains a BIOS backup image.


Other observations:
Device order in Linux (e.g. /dev/sda /dev/sdb) made no difference to 
where the HPA ended up.




Jan Hellevik wrote:

Yes, I turned the system off before I connected the disks to the other 
controller. And I turned the system off beore moving them back to the original 
controller.

Now it seems like the system does not see the pool at all. 


The disks are there, and they have not been used so I do not understand why I 
cannot see the pool anymore.

Short version of what I did (actual output is in the original post):
zpool status -> pool is there but unavailable
zpool import -> pool already created
zpool export -> I/O error
format
cfgadm
zpool status -> pool is gone..

It seems like the pool vanished after cfgadm?

Any pointers? I am really getting worried now that the pool is gone for good.

What I do not understand is why it is gone - the disks are still there, so it 
should be possible to import the pool?

What am I missing here? Any ideas?
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Moved disks to new controller - cannot import pool even after moving back

2010-05-15 Thread Haudy Kazemi
Can you recreate the problem with a second pool on a second set of 
drives, like I described in my earlier post?  Right now it seems like 
your problem is mostly due to the missing log device.  I'm wondering if 
that missing log device is what messed up the initial move to the other 
controller, or if the other controller did something to the disks when 
it saw them.



Jan Hellevik wrote:

I don't think that is the problem (but I am not sure). It seems like te problem 
is that the ZIL is missing. It is there, but not recognized.

I used fdisk to create a 4GB partition of a SSD, and then added it to the pool 
with the command 'zpool add vault log /dev/dsk/c10d0p1'.

When I try to import the pool is says the log is missing. When I try to add the 
log to the pool it says there is no such pool (since it isn't imported yet). 
Catch22? :-)
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Moved disks to new controller - cannot import pool even after moving back

2010-05-15 Thread Haudy Kazemi

Jan Hellevik wrote:

Yes, I can try to do that. I do not have any more of this brand of disk, but I 
guess that does not matter. It will have to wait until tomorrow (I have an 
appointment in a few minutes, and it is getting late here in Norway), but I 
will try first thing tomorrow. I guess a pool on a single drive will do the 
trick? I can create the log as a partition on yet another drive just as I did 
with the SSD (do not want to mess with it just yet). Thanks for helping!
  


In this case the specific brand and model of drive probably does not matter.

The most accurate test will be to setup a test pool as similar as 
possible to the damaged pool, i.e. a 4 disk RAIDZ1 with a log on a 
partition of a 5th disk.


A single drive pool might do the trick for testing, but it has no 
redundancy.  The smallest pool with redundancy is a mirror, thus the 
suggestion to use a mirror.  If you have enough spare/small/old drives 
that are compatible with the second controller, use them to model your 
damaged pool.  For this test it doesn't really matter if these are 4 gb 
or 40 gb or 400 gb drives.



Try the following things in order.  Keep a copy of the terminal commands 
you use and the command responses you get.


1.) Wipe (e.g. dban/dd/zero wipe) disks that will make up the test pool, 
and create the test pool.  Copy some test data to the pool, like an 
OpenSolaris ISO file.
Try migrating the disks to the second controller the same way you did 
with your damaged pool.  Use the exact same steps in the same order.  
See your notes/earlier posts while doing this to make sure you remember 
them exactly.
If that works (a forced import will likely be needed), then you might 
have had a one time error, or hard to reproduce error, or maybe did a 
minor step slightly differently from how you remembered doing it with 
the damaged pool.

If that fails, then you may have a repeatable test case.

2.) Wipe (e.g. dban/dd/zero wipe) disks that made up the test pool, and 
recreate the test pool. Copy some test data to the pool, like an 
OpenSolaris ISO file.
Try migrating the disks the recommended way, using export, powering 
everything off, and then import.
If that works (without needed a forced import), then skipping the export 
was likely a trigger.
If that fails, it seems like the second controller is doing something to 
the disks.  Look at the controller BIOS settings for something relevant 
and see if there are any firmware updates available.


3.) If you have a third (different model) controller (or a another 
computer running the same Solaris version with a different controller), 
repeat step 2 with it.  If step 2 failed but this works, that's more 
evidence the second controller is up to something.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedup status

2010-05-16 Thread Haudy Kazemi

Erik Trimble wrote:

Roy Sigurd Karlsbakk wrote:

Hi all

I've been doing a lot of testing with dedup and concluded it's not 
really ready for production. If something fails, it can render the 
pool unuseless for hours or maybe days, perhaps due to single-threded 
stuff in zfs. There is also very little data available in the docs 
(though I've from what I've got on this list) on how much memory one 
should have for deduping an xTiB dataset.
  
I think it was Richard a month or so ago that had a good post about 
about how much space the Dedup Table entry would be (it was in some 
discussion where I ask about it).  I can't remember what it was (a 
hundred bytes?) per DDT entry, but one had to remember that each entry 
was for a slab, which can vary in size (512 bytes to 128k).  So, 
there's no good generic formula for X bytes in RAM per Y TB space.  
You can compute a rough guess if you know what kind of data and the 
general usage pattern is for the  pool (basically, you need to take a 
stab at  how big you think the average slab size is).   Also, remember 
that if you have a /very/ good dedup ratio, then you will have a 
smaller DDT for a given X size pool, vs a pool with poor dedup ratios.  
Unfortunately, there's no magic bullet, though if you can dig up 
Richard's post, you should be able to take a guess, and not be off 
more than x2 or so. 
Also, remember you only need to hold the DDT in L2ARC, not in actual 
RAM, so buy that SSD, young man!


As far as failures, well, I can't speak to that specifically. Though, 
do realize that not having sufficient L2ARC/RAM to hold the DDT does 
mean that you spend an awful amount of time reading pool metadata, 
which really hurts performance (not to mention can cripple deleting of 
any sort...)


Here's Richard Elling's post in the "dedup and memory/l2arc 
requirements" thread where he presents a worst case DDT size upper bound:

http://mail.opensolaris.org/pipermail/zfs-discuss/2010-April/039516.html

--start of copy--

You can estimate the amount of disk space needed for the deduplication table

and the expected deduplication ratio by using "zdb -S poolname" on your existing
pool.  Be patient, for an existing pool with lots of objects, this can take 
some time to run.

# ptime zdb -S zwimming
Simulated DDT histogram:

bucket  allocated   referenced  
__   __   __

refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
--   --   -   -   -   --   -   -   -
12.27M239G188G194G2.27M239G188G194G
2 327K   34.3G   27.8G   28.1G 698K   73.3G   59.2G   59.9G
430.1K   2.91G   2.10G   2.11G 152K   14.9G   10.6G   10.6G
87.73K691M529M529M74.5K   6.25G   4.79G   4.80G
   16  673   43.7M   25.8M   25.9M13.1K822M492M494M
   32  197   12.3M   7.02M   7.03M7.66K480M269M270M
   64   47   1.27M626K626K3.86K103M   51.2M   51.2M
  128   22908K250K251K3.71K150M   40.3M   40.3M
  2567302K 48K   53.7K2.27K   88.6M   17.3M   19.5M
  5124131K   7.50K   7.75K2.74K102M   5.62M   5.79M
   2K1  2K  2K  2K3.23K   6.47M   6.47M   6.47M
   8K1128K  5K  5K13.9K   1.74G   69.5M   69.5M
Total2.63M277G218G225G3.22M337G263G270G

dedup = 1.20, compress = 1.28, copies = 1.03, dedup * compress / copies = 1.50


real 8:02.391932786
user 1:24.231855093
sys15.193256108

In this file system, 2.75 million blocks are allocated. The in-core size
of a DDT entry is approximately 250 bytes.  So the math is pretty simple:
in-core size = 2.63M * 250 = 657.5 MB

If your dedup ratio is 1.0, then this number will scale linearly with size.
If the dedup rate > 1.0, then this number will not scale linearly, it will be
less. So you can use the linear scale as a worst-case approximation.
-- richard

--end of copy--


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Moved disks to new controller - cannot import pool even after moving back

2010-05-16 Thread Haudy Kazemi
I don't really have an explanation.  Perhaps flaky second controller 
hardware that only works sometimes and can corrupt pools?  Have you seen 
any other strangeness/instability on this computer? 

Did you use zpool export before moving the disks the first time to the 
second controller, or did you just move them without exporting?


If you dd zero wipe the disks that made up this test pool, and then 
recreate the test pool, does it behave the same way the second time?




Jan Hellevik wrote:

Ok - this is really strange. I did a test. Wiped my second pool (4 disks like 
the other pool), and used them to create a pool similar to the one I have 
problems with.

Then i powered off, moved the disks and powered on. Same error message as 
before. Moved the disks back to the original controller. Pool is ok. Moved the 
disks to the new controller.  At first it is exactly like my original problem, 
but when i did a second zpool import, the pool is imported ok.

Zpool status reports the same as before. I run the same command as I did the 
first time:
zpool status
zpool import
zpool export
format
cfgadm
zpool status
zpool import ---> now it imports the pool!

How can this be? The only difference (as far as I can tell) is that the cache/log is on a 
2.5" Samsung disk insted of a 2.5" OCZ SSD.

Details follow (it is long - sorry):

Also note below - I did a zpool destroy mpool before poweroff - when I powered 
on and did a zpool status it show the pool as UNAVAIL. It should not be there 
at all, if I understand correctly?

- create the partitions for log and cache

 Total disk size is 30401 cylinders
 Cylinder size is 16065 (512 byte) blocks

   Cylinders
  Partition   StatusType  Start   End   Length%
  =   ==  =   ===   ==   ===
  1 Solaris2  1   608 608  2
  2 Solaris2609  30402432  8

format> quit
j...@opensolaris:~# zpool destroy mpool
j...@opensolaris:~# poweroff

Last login: Sun May 16 17:07:15 2010 from macpro.janhelle
Sun Microsystems Inc.   SunOS 5.11  snv_134 February 2010
j...@opensolaris:~$ pfexec bash
j...@opensolaris:~# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
   0. c8d0 
  /p...@0,0/pci-...@14,1/i...@0/c...@0,0
   1. c10d0 
  /p...@0,0/pci-...@11/i...@0/c...@0,0
   2. c10d1 
  /p...@0,0/pci-...@11/i...@0/c...@1,0
   3. c11d0 
  /p...@0,0/pci-...@11/i...@1/c...@0,0
   4. c12d0 
  /p...@0,0/pci-...@14,1/i...@1/c...@0,0
   5. c12d1 
  /p...@0,0/pci-...@14,1/i...@1/c...@1,0
Specify disk (enter its number): ^C
j...@opensolaris:~# zpool create vault2 raidz c10d1 c11d0 c12d0 c12d1
j...@opensolaris:~# zpool status

-- this pool is the one I destroyed - why is it here now?

  pool: mpool
 state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
mpoolUNAVAIL  0 0 0  insufficient replicas
  mirror-0   UNAVAIL  0 0 0  insufficient replicas
c13t2d0  UNAVAIL  0 0 0  cannot open
c13t0d0  UNAVAIL  0 0 0  cannot open
  mirror-1   UNAVAIL  0 0 0  insufficient replicas
c13t3d0  UNAVAIL  0 0 0  cannot open
c13t1d0  UNAVAIL  0 0 0  cannot open

  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c8d0s0ONLINE   0 0 0

errors: No known data errors

  pool: vault2
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
vault2  ONLINE   0 0 0
  raidz1-0  ONLINE   0 0 0
c10d1   ONLINE   0 0 0
c11d0   ONLINE   0 0 0
c12d0   ONLINE   0 0 0
c12d1   ONLINE   0 0 0

errors: No known data errors
j...@opensolaris:~# zpool destroy mpool
cannot open 'mpool': I/O error
j...@opensolaris:~# zpool status -x
all pools are healthy
j...@opensolaris:~# 
j...@opensolaris:~# 
j...@opensolaris:~# zpool status 



-- and now the pool is vanished

  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c8d0s0ONLINE   0 0 0

errors: No known data errors

  pool: vault2
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
vault2  

Re: [zfs-discuss] New SSD options

2010-05-22 Thread Haudy Kazemi

Bob Friesenhahn wrote:

On Fri, 21 May 2010, Don wrote:

You could literally split a sata cable and add in some capacitors for 
just the cost of the caps themselves. The issue there is whether the 
caps would present too large a current drain on initial charge up- If 
they do then you need to add in charge controllers and you've got the 
same problems as with a LiPo battery- although without the shorter 
service life.


Electricity does run both directions down a wire and the capacitor 
would look like a short circuit to the supply when it is first turned 
on.  You would need some circuitry which delays applying power to the 
drive before the capacitor is sufficiently charged, and some circuitry 
which shuts off the flow of energy back into the power supply when the 
power supply shuts off (could be a silicon diode if you don't mind the 
0.7 V drop).


Bob


You can also use an appropriately wired field effect transistor (FET) / 
MOSFET of sufficient current carrying capacity as a one-way valve 
(diode) that has minimal voltage drop.

More:
http://electronicdesign.com/article/power/fet-supplies-low-voltage-reverse-polarity-protecti.aspx
http://www.electro-tech-online.com/general-electronics-chat/32118-using-mosfet-diode-replacement.html


In regard to how long do you need to continue supplying power...that 
comes down to how long does the SSD wait before flushing cache to 
flash.  If you can identify the maximum write cache flush interval, and 
size the battery or capacitor to exceed that maximum interval, you 
should be okay.  The maximum write cache flush interval is determined by 
a timer that says something like "okay, we've waited 5 seconds for 
additional data to arrive to be written.  None has arrived in the last 5 
seconds, so we're going to write what we already have to better ensure 
data integrity, even though it is suboptimal from a absolute performance 
perspective."  In conventional terms of filling city buses...the bus 
leaves when it is full of people, or 15 minutes has passed since the 
last bus left.


Does anyone know if there is a way to directly or indirectly measure the 
write caching flush interval?  I know cache sizes can be found via 
performance testing, but what about write intervals?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Understanding ZFS performance.

2010-05-22 Thread Haudy Kazemi

Brian wrote:

Sometimes when it hangs on boot hitting space bar or any key won't bring it 
back to the command line.  That is why I was wondering if there was a way to 
not show the splashscreen at all, and rather show what it was trying to load 
when it hangs.
  

Look at these threads:

OpenSolaris b134 Genunix site iso failing boot
http://opensolaris.org/jive/thread.jspa?threadID=125445&tstart=0

Build 134 Won't boot
http://ko.opensolaris.org/jive/thread.jspa?threadID=125486
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6932552

How to bypass boot splash screen?
http://opensolaris.org/jive/thread.jspa?messageID=355648


They talk changing some Grub menu.lst options by either adding 
'console=text' or removing 'console=graphics' .  See if that works for 
you too.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] USB Flashdrive as SLOG?

2010-05-25 Thread Haudy Kazemi

Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Kyle McDonald

I've been thinking lately that I'm not sure I like the root pool being
unprotected, but I can't afford to give up another drive bay. 



I'm guessing you won't be able to use the USB thumbs as a boot device.  But
that's just a guess.

However, I see nothing wrong with mirroring your primary boot device to the
USB.  At least in this case, if the OS drive fails, your system doesn't
crash.  You're able to swap the OS drive and restore your OS mirror.


  

That led me to wonder whether partitioning out 8 or 12 GB on a 32GB
thumb drive would be beneficial as an slog?? 



I think the only way to find out is to measure it.  I do have an educated
guess though.  I don't think, even the fastest USB flash drives are able to
work quickly, with significantly low latency.  Based on measurements I made
years ago, so again I emphasize, only way to find out is to test it.

One thing you could check, which does get you a lot of mileage for "free"
is:  Make sure your HBA has a BBU, and enable the WriteBack.  In my
measurements, this gains about 75% of the benefit that log devices would
give you.
  


There are or at least have been some issues with ZFS and devices.  
Here's one that is still open:

Bug 4755 - ZFS boot does not work with removable media (usb flash memory)
http://defect.opensolaris.org/bz/show_bug.cgi?id=4755

Regarding performance...USB flash drives vary significantly in 
performance from one another between brands and models.  Some get close 
to USB 2.0 theoretical limits, others just barely exceed USB 1.1.  Vista 
and Windows 7 support the use of USB flash drives for ReadyBoost, a 
caching system to reduce application load times.  Windows tests have 
shown that with enough RAM, that ReadyBoost caching offers little 
additional performance (as Windows does make use of system RAM for file 
caching too).


I think using good USB flash drives has the potential to improve 
performance, and if you can keep mirrored flash drives on different, 
dedicated USB controllers that will help performance the most.  If USB 
support in OpenSolaris has is poor and has weak performance, I wonder if 
an iSCSI target created out of the USB device on a Linux or Windows 
system on the same network might be able to offer better performance.  
Even if latency goes to 2-3ms, that's still much better than the 8.5 ms 
random seek times on a 7200 rpm hard disk.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] nfs share of nested zfs directories?

2010-05-27 Thread Haudy Kazemi

Brandon High wrote:

On Thu, May 27, 2010 at 1:02 PM, Cassandra Pugh  wrote:
  

   I was wondering if there is a special option to share out a set of nested
   directories?  Currently if I share out a directory with
/pool/mydir1/mydir2
   on a system, mydir1 shows up, and I can see mydir2, but nothing in
mydir2.
   mydir1 and mydir2 are each a zfs filesystem, each shared with the proper
   sharenfs permissions.
   Did I miss a browse or traverse option somewhere?



What kind of client are you mounting on? Linux clients don't properly
follow nested exports.

-B
  


This behavior is not limited to Linux clients nor to nfs shares.  I've 
seen it with Windows (SMB) clients and CIFS shares.  The CIFS version is 
referenced here:


Nested ZFS Filesystems in a CIFS Share
http://mail.opensolaris.org/pipermail/cifs-discuss/2008-June/000358.html
http://bugs.opensolaris.org/view_bug.do?bug_id=6582165

Is there any commonality besides the observed behaviors?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-18 Thread Haudy Kazemi

Brandon High wrote:
On Fri, Jul 9, 2010 at 5:18 PM, Brandon High > wrote:


I think that DDT entries are a little bigger than what you're
using. The size seems to range between 150 and 250 bytes depending
on how it's calculated, call it 200b each. Your 128G dataset would
require closer to 200M (+/- 25%) for the DDT if your data was
completely unique. 1TB of unique data would require 600M - 1000M
for the DDT.


Using 376b per entry, it's 376M for 128G of unique data, or just under 
3GB for 1TB of unique data.


A 1TB zvol with 8k blocks would require almost 24GB of memory to hold 
the DDT. Ouch.


-B



To reduce RAM requirements, consider an offline or idle time dedupe.  I 
suggested a variation of this in regards to compress a while ago, 
probably on this list.


In either case, you have the system write the data whichever way is fastest.

If there is enough unused CPU power, run maximum compression, otherwise 
use fast compression.  If new data type specific compression algorithms 
are added, attempt compression with those as well (e.g. lossless JPEG 
recompression that can save 20-25% space).  Store the block in whichever 
compression format works best.


If there is enough RAM to maintain a live dedupe table, dedupe right away.

If CPU and RAM pressures are too high, defer dedupe and compression to a 
periodic scrub (or some other new periodically run command).  In the 
deferred case, the dedupe table entries could be generated as blocks are 
filled/change and then kept on disk.  Periodically that table would be 
quicksorted by the hash, and then any duplicates would be found next to 
each other.  The blocks for the duplicates would be looked up, verified 
as truly identical, and then re-written (probably also using BP 
rewrite).  Quicksort is parallelable and sorting a multi-gigabyte table 
is a plausible operation, even on disk.  Quicksort 100mb pieces of it in 
RAM and iterate until the whole table ends up sorted.


The end result of all this idle time compression and deduping is that 
the initially allocated storage space becomes the upper bound storage 
requirement, and that the data will end up packing tighter over time.  
The phrasing on bulk packaged items comes to mind: "Contents may have 
settled during shipping".



Now a theoretical question about dedupe...what about the interaction 
with defragmentation (this also probably needs BP rewrite)?  The first 
file will be completely defragmented, but the second file that is a 
slight variation of the first will have at least two fragments (the 
deduped portion, and the unique portion).  Probably the performance 
impact will be minor as long as each fragment is a decent minimum size 
(multiple MB).


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help identify failed drive

2010-07-19 Thread Haudy Kazemi

A few things:

1.) did you move your drives around or change which controller each one 
was connected to sometime after installing and setting up OpenSolaris?  
If so, a pool export and re-import may be in order.


2.) are you sure the drive is failing?  Does the problem only affect 
this drive or are other drives randomly affect too?  If you've run 
'zpool clear' and the problem comes back, something is wrong but it 
could also be RAM, CPU, motherboard, controller, or power supply 
problems.  Smartmontools can read the drive SMART data and device error 
logs...run it from an Ubuntu 10.04 Live CD (sudo apt-get install 
smartmontools) or from a PartedMagic Live CD if you have trouble getting 
Smartmontools working on OpenSolaris with your hardware.


3.) on some systems I've found another version of the iostat command to 
be more useful, particularly when iostat -En leaves the serial number 
field empty or otherwise doesn't read the serial number correctly.  Try 
this:


iostat -Eni

This should give you a list of drives showing their name in the cXtYdZsN 
format, and their Device ID which may contain the drive serial numbers 
concatenated with the model.  Compare that list with your 'zpool status 
tank' output, which in your case means looking for 'c2t3d0'.  Once you 
find the serial number, you can look at labels printed on your drives 
and verify which one it is.


One tip: if your server case is hard to work in or it is otherwise 
difficult to remove drives to read the serial numbers (lots of screws, 
cables in the way, tight fits, etc.), create additional serial number 
labels for the drives and stick them on the drive in a place you can 
read them without removing the drive from the drive bay.  This will make 
it easier to find a particular drive next time you need to replace or 
upgrade one.  This problem most significant on hardware/OS combinations 
that don't provide a way to signal where a particular drive is 
physically installed.  (This includes a lot of whitebox and small server 
hardware and OSes.)




iostat relevant man page entries:

http://docs.sun.com/app/docs/doc/816-5166/iostat-1m?l=en&n=1&a=view 



-E
Display all device error statistics.

-i
In -E output, display the Device ID instead of the Serial No. The Device 
Id is a unique identifier registered by a driver through 
ddi_devid_register(9F).


-n
Display names in descriptive format. For example, cXtYdZ, rmt/N, 
server:/export/path.


By default, disks are identified by instance names such as ssd23 or 
md301. Combining the -n option with the -x option causes disk names to 
display in the cXtYdZsN format which is more easily associated with 
physical hardware characteristics. The cXtYdZsN format is particularly 
useful in FibreChannel (FC) environments where the FC World Wide Name 
appears in the t field.







Cindy Swearingen wrote:

Hi--

A google search of ST3500320AS turns up Seagate Barracuda drives.

All 7 drives in the pool tank are ST3500320AS. The other two c1t0d0
and c3d0 are unknown, but are not part of this pool.

You can also use fmdump -eV to see how long c2t3d0 has had problems.

Thanks,

Cindy

On 07/19/10 09:29, Yuri Homchuk wrote:

Thanks Cindy,

But format shows exactly same thing:
All of them appear as Seagate, no WD at all...
How could it be ???

# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
   0. c1t0d0 
  /p...@0,0/pci15d9,a...@5/d...@0,0
   1. c1t1d0 
  /p...@0,0/pci15d9,a...@5/d...@1,0
   2. c2t0d0 
  /p...@0,0/pci10de,3...@a/pci15d9,a...@0/s...@0,0
   3. c2t2d0 
  /p...@0,0/pci10de,3...@a/pci15d9,a...@0/s...@2,0
   4. c2t3d0 
  /p...@0,0/pci10de,3...@a/pci15d9,a...@0/s...@3,0
   5. c2t4d0 
  /p...@0,0/pci10de,3...@a/pci15d9,a...@0/s...@4,0
   6. c2t5d0 
  /p...@0,0/pci10de,3...@a/pci15d9,a...@0/s...@5,0
   7. c2t7d0 
  /p...@0,0/pci10de,3...@a/pci15d9,a...@0/s...@7,0
   8. c3d0 
  /p...@1,0/pci1022,7...@2/pci-...@1/i...@1/c...@0,0
Specify disk (enter its number): ^C


Thanks again.


-Original Message-
From: Cindy Swearingen [mailto:cindy.swearin...@oracle.com] Sent: 
Monday, July 19, 2010 9:16 AM

To: Yuri Homchuk
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Help identify failed drive

Hi--

I don't know what's up with iostat -En but I think I remember a 
problem where iostat does not correctly report drives running in 
legacy IDE mode.


You might use the format utility to identify these devices.

Thanks,

Cindy
On 07/18/10 14:15, Alxen4 wrote:

This is a situation:

I've got an error on one of the drives in 'zpool status' output:

 zpool status tank

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are 
unaffected.
action: Determine if the device needs to be replaced, and clear 

Re: [zfs-discuss] Help identify failed drive

2010-07-19 Thread Haudy Kazemi



3.) on some systems I've found another version of the iostat command to be more 
useful, particularly when iostat -En leaves the serial number field empty or 
otherwise doesn't read the serial number correctly.  Try
this:
  


' iostat -Eni ' indeed outputs Device ID on some of the drives,but I still 
can't understand how it helps me to identify model of specific drive.
  

See below.  Some


# iostat -Eni
c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: GIGABYTE i-RAM  Revision:  Device Id: 
id1,c...@agigabyte_i-ram=33f100336d9cc244b01d
Size: 2.15GB <2146443264 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
  

GIGABYTE i-RAM (2GB RAM based SSD)
Probably serial number: 33F100336D9CC244B01D


c1t0d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3500320AS  Revision: SD15 Device Id: 
id1,s...@ast3500320as=9qm34ybz
Size: 500.11GB <500107862016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 56 Predictive Failure Analysis: 0
  
c1t0d0 = controller 1 target 0 device 0.  Match zpool status devices 
names with this.

ST3500320AS=9QM34YBZ
Model: ST3500320AS   (ST means Seagate Technologies)
Serial: 9QM34YBZ


c1t1d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3500320AS  Revision: SD15 Device Id: 
id1,s...@ast3500320as=9qm353d2
Size: 500.11GB <500107862016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 66 Predictive Failure Analysis: 0
  

c1t1d0 = controller 1 target 1 device 0
ST3500320AS=9QM353D2


c0t0d0   Soft Errors: 0 Hard Errors: 198 Transport Errors: 0
Vendor: SONY Product: CD-ROM CDU5212   Revision: 5YS1 Device Id:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 198 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
  

Sony CDU5212 52X CD optical drive


c2t0d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3500320AS  Revision: SD15 Device Id: 
id1,s...@n5000c5000b9f49ef
Size: 500.11GB <500107862016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
  

c2t0d0 = controller 2 target 0 device 0

You should seriously consider checking, and probably updating, the 
firmware on all your Seagate ST3500320AS 7200.11 drives.  Version SD15 
(as reported above) is a known bad firmware.

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951


c2t1d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3500320AS  Revision: SD15 Device Id:
Size: 500.11GB <500107862016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
  

controller 2 again

c2t2d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3500320AS  Revision: SD15 Device Id: 
id1,s...@n5000c5000b9be7ab
Size: 500.11GB <500107862016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
  

controller 2 again

c2t3d0   Soft Errors: 0 Hard Errors: 9 Transport Errors: 9
Vendor: ATA  Product: ST3500320AS  Revision: SD15 Device Id: 
id1,s...@n5000c5000b9fa1ab
Size: 500.11GB <500107862016 bytes>
Media Error: 7 Device Not Ready: 0 No Device: 2 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
  
controller 2 again.  This one is the suspect (c2t3d0) but we don't know 
the serial number.  The controller is telling us it is 
'n5000c5000b9fa1ab' which must be a unique ID the controller made up for 
itself.

c2t4d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3500320AS  Revision: SD15 Device Id: 
id1,s...@n5000c5000b9fde5f
Size: 500.11GB <500107862016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
  

controller 2 again

c2t5d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3500320AS  Revision: SD15 Device Id: 
id1,s...@n5000c5000b9fae42
Size: 500.11GB <500107862016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
  

controller 2 again

c2t6d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3500320AS  Revision: SD15 Device Id:
Size: 500.11GB <500107862016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
  

controller 2 again.  No Device Id reported.  (strange)(not really present?)


c2t7d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3500320AS  Revision: SD15 Device Id: 

Re: [zfs-discuss] Help identify failed drive

2010-07-19 Thread Haudy Kazemi



This is Supermicro Server.

I really don't remember controller model, I set it up about 3 years 
ago. I just remember that I needed to reflush controller firmware to 
make it work in JBOD mode.


Remember, changing controller firmware may affect your ability to access 
drives.  Backup first, as your array is still working although 
degraded.  Then update the firmware of the controller(s), and the 
firmware of your Seagate 7200.11 drives.



Note that the preferred modes are in order of choice:
1.) plain AHCI ports connected to the PCI-E bus (includes most built-in 
ports on recent motherboards; older boards may have them on the PCI bus)

2.) RAID ports configured as single drive arrays
3.) JBOD ports configured as single drive JBODs

1 is best
2 is preferred over 3 because some controllers have lower performance in 
JBOD mode or hide features.
IDE (PATA) ports with a single master drive on them are approximately 
1.5 on the ranking.  Putting a second drive on a PATA port is like using 
a SATA port multiplier: your bandwidth gets reduce and performance can 
suffer.



 


I run the script you suggested:

But it looks like it's still unable to map sd11 and sd12 to an actual 
c*t*d*...


How many different controllers do you have?  You'll need to look all 
this up to sort out the mess.  Your logs show you have at least 3 
different controllers (c1, c2, and c3) and maybe more for the sd11 and 
sd12 devices.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help identify failed drive

2010-07-19 Thread Haudy Kazemi

Marty Scholes wrote:

' iostat -Eni ' indeed outputs Device ID on some of
the drives,but I still
can't understand how it helps me to identify model
of specific drive.
  


Get and install smartmontools.  Period.  I resisted it for a few weeks but it 
has been an amazing tool.  It will tell you more than you ever wanted to know 
about any disk drive in the /dev/rdsk/ tree, down to the serial number.

I have seen zfs remember original names in a pool after they have been renamed by the OS 
such that "zpool status" can list c22t4d0 as a drive in the pool when there 
exists no such drive on the system.
  
Run smartmontools on a Linux LiveCD if necessary.  For a while (at least 
when OpenSolaris 2009.06 was released) smartmontools could not get drive 
information on drives on certain controllers.



Why has it been reported as bad (for probably 2
months now, I haven't
got around to figuring out which disk in the case it
is etc.) but the
iostat isn't showing me any errors.



Start a scrub or do an obscure find, e.g. "find /tank_mointpoint -name core" 
and watch the drive activity lights.  The drive in the pool which isn't blinking like 
crazy is a faulted/offlined drive.

Ugly and oh-so-hackerish, but it works.
  
You might also be able to figure it out from drive vibration or a lack 
thereof.  Many people rolling their own server hardware don't have 
per-drive activity lights, hence the recommendation to figure out how to 
identify drives in software via their serial numbers and then match up 
with the labels.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help identify failed drive

2010-07-19 Thread Haudy Kazemi

Yuri Homchuk wrote:


 

 

Well, this is a REALLY 300 users production server with 12 VM's 
running on it, so I definitely won't play with a firmware J


I can easily identify which drive is what by physically looking at it.

It's just sad to realize that I cannot trust solaris anymore.

I never noticed this problem before because we were always using 
 Seagate drives, so I didn't notice any difference


 


In my understanding there are three controllers:

 


C1 -- build-in AHCI controller

C2 -- build-in controller that I needed to reflush

C3 -- PCI card old sata 1.5 controller- not in use, just ignore it.


Which drives are physically connected to which controller?


 


I guess C2 is the one that gives me hassles.

 


Is there way to retrieve the model from solaris ?


dmidecode might do it.  I don't know exactly what syntax it needs though.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on Ubuntu

2010-07-19 Thread Haudy Kazemi

Rodrigo E. De Leรณn Plicet wrote:

On Fri, Jun 25, 2010 at 9:08 PM, Erik Trimble  wrote:
  

(2) Ubuntu is a desktop distribution. Don't be fooled by their "server"
version. It's not - it has too many idiosyncrasies and bad design choices to
be a stable server OS.  Use something like Debian, SLES, or RHEL/CentOS.



Why would you say that?

What "idiosyncrasies and bad design choices" are you talking about?

Just curious


I asked Erik about this.  Here is the consolidated discussion:


E>> (2) Ubuntu is a desktop distribution. Don't be fooled by their
E>> "server" version. It's not - it has too many idiosyncrasies and bad
E>> design choices to be a stable server OS.  Use something like 
Debian,

E>> SLES, or RHEL/CentOS.

H> Can you explain some of or link me to the bad design choices or
H> idiosyncrasies that make Ubuntu not stable as a server OS? 
H>
H> I'm only refering to the Ubuntu Server LTS releases (5 year long 
term

H> support) which to date have included 6.06, 8.04, and 10.04.  The LTS
H> releases are held to a higher standard (more showcase) than their 
other

H> releases (testing).  Ubuntu uses non-LTS releases to encourage rapid
H> project development (e.g. adding GRUB2 in 9.10).  I won't argue 
about

H> the non-LTS releases.

E Specifically, even on the LTS releases, the "server" version actually
E uses the same packages as the "desktop" version, which leads to 
problems

E with assumed defaults. Classic case is iptables, where many of the
E management packages that go with it use desktop-oriented defaults 
rather

E than server-oriented defaults.  So, when you upgrade (or, even just
E update), it tends to break things.

H>>> I've seen stuff break on upgrades as well, although updates have 
been fine

H>>> for me.  This breakage is why I delay upgrades on shared machines that
H>>> impact others (e.g. on HTPCs with MythTV on Ubuntu) until I know I 
can do

H>>> a clean slate rebuild if the upgrade doesn't go right, and often do the
H>>> clean slate rebuild anyway.  I treat an Ubuntu upgrade similar to a
H>>> Windows upgrade (e.g. WinXP to Win7).

E I've also seen issues with obsoleting/removing packages (or, more
E likely, specific items from packages) without notification. I lost the
E libstdc++5 library after 8.04 (it's not even in the 10.04 LTS), and
E there was no mention of it being deleted, not even buried in release
E notes.

H>>> Features lost to claims of 'UI design improvement and 
simplification' are

H>>> right up there on the annoyance list.  Removed features often end up in
H>>> big discussion threads on http://brainstorm.ubuntu.com/ .  This moving
H>>> target characteristic certainly makes it harder to get programs working
H>>> that aren't already in the repositories.  At least an LTS release has a
H>>> safe 3 year usage window (as all LTS packages are maintained for 3 
years,

H>>> and server packages are maintained for 5 years).

E Ubuntu doesn't seem to really care about long-term stability.  I'm not
E talking about the kernel ABI (which, is really out of their 
hands). I'm

E talking about being very careful about not breaking userland and
E admin-land stuff without advanced notice, and significant failure to
E support a transition period.  Stuff just goes away and/or breaks at a
E whim between releases.
 
H>>> Ubuntu's long term stability (i.e. software compatibility) appears

H>>> intended to stay within a single LTS version, which I feel is really
H>>> only 'for sure' for 3 years.  That is on the short end of the range of
H>>> even popular general desktop OSes (WinXP is an outlier).
H>>>
H>>> In spite of its shortcoming, the primary advantages Ubuntu 
maintains are

H>>> 1.) wider, earlier hardware support
H>>> 2.) rapid iteration: a regular release cycle that gets software into
H>>> testing and use, in preparation for the next release cycle for that
H>>> software and Ubuntu as a whole
H>>>
H>>> I think Ubuntu LTS still makes sense for machines (even servers) that
H>>> don't need long-term stability (i.e. software compatibility) and can
H>>> benefit from the earlier hardware support it offers.  It's not an 
OS to

H>>> install and leave alone (only patching) for extended time periods.

E>> All of which is fine if you're running a home server, or maybe designing
E>> a black-box device.  None of which is acceptable for general-purpose,
E>> server-room machines. 


H> It is fine for virtual servers intended to run small/ephemeral
H> websites that you don't mind migrating again in the near term.  Not so
H> good for an important piece of backbone infrastructure that simply
H> needs to run without periodic tuning.

E>> Product cycles there are 8-10 years, and there
E>> has to be significant upgradability (i.e. I should be able to expect to
E>> upgrade my OS and not break anything for a span of about 20 years,
E>> covering probably 3 major releases).

Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143

2010-07-20 Thread Haudy Kazemi



Could it somehow not be compiling 64-bit support?


--
Brent Jones



I thought about that but it says when it boots up that it is 64-bit, and I'm 
able to run
64-bit binaries.  I wonder if it's compiling for the wrong processor 
optomization though?
Maybe if it is missing some of the newer SSEx instructions the zpool checksum 
checking is
slowed down significantly?  I don't know how to check for this though and it 
seems strange
it would slow it down this significantly.  I'd expect even a non-SSE enabled binary to 
be able to calculate a few hundred MB of checksums per second for a 2.5+ghz processor.


Chad


Would it be possible to do a closer comparison between Rich Lowe's fast 
142 build and your slow 142 build?  For example run a diff on the 
source, build options, and build scripts.  If the build settings are 
close enough, a comparison of the generated binaries might be a faster 
way to narrow things down (if the optimizations are different then a 
resultant binary comparison probably won't be useful).


You said previously that:

The procedure I followed was basically what is outlined here:
http://insanum.com/blog/2010/06/08/how-to-build-opensolaris

using the SunStudio 12 compilers for ON and 12u1 for lint.
  
Are these the same compiler versions Rich Lowe used?  Maybe there is a 
compiler optimization bug.  Rich Lowe's build readme doesn't tell us 
which compiler he used.

http://genunix.org/dist/richlowe/README.txt


I suppose the easiest way for me to confirm if there is a regression or if my
compiling is flawed is to just try compiling snv_142 using the same procedure
and see if it works as well as Rich Lowe's copy or if it's slow like my other
compilations.

Chad


Another older compilation guide:
http://hub.opensolaris.org/bin/view/Community+Group+tools/building_opensolaris


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 1tb SATA drives

2010-07-24 Thread Haudy Kazemi



But if it were just the difference between 5min freeze when a drive
fails, and 1min freeze when a drive fails, I don't see that anyone
would care---both are bad enough to invoke upper-layer application
timeouts of iSCSI connections and load balancers, but not disastrous.

but it's not.  ZFS doesn't immediately offline the drive after 1 read
error.  Some people find it doesn't offline the drive at all, until
they notice which drive is taking multiple seconds to complete
commands and offline it manually.  so you have 1 - 5 minute freezes
several times a day, every time the slowly-failing drive hits a latent
sector error.

I'm saying the works:notworks comparison is not between TLER-broken
and non-TLER-broken.  I think the TLER fans are taking advantage of
people's binary debating bias to imply that TLER is the ``works OK''
case and non-TLER is ``broken: dont u see it's 5x slower.''  There are
three cases to compare for any given failure mode: TLER-failed,
non-TLER-failed, and working.  The proper comparison is therefore
between a successful read (7ms) and an unsuccessful read (7000ms * 
cargo-cult retries put into various parts of the stack to work around
some scar someone has on their knee from some weird thing an FC switch
once did in 1999).
  
If you give a drive enough retries on a sector giving a read error, 
sometimes it can get the data back.  I once had project with an 80gb 
Maxtor IDE drive that I needed to get all the files off of.  One file (a 
ZIP archive) was sitting over a sector with a read error.  I found that 
I could get what appeared to be partial data from the sector using 
Ontrack EasyRecovery, but the data read back from the 512 byte sector 
was slightly different each time.  I manually repeated this a few times 
and got it down to about a few bytes out of the 512 that were different 
on each re-read attempt.  Looking at those further I figured it was 
actually only a few bits of each of those bytes that were different each 
time, and I could narrow that down as well by looking at the frequency 
of the results of each read.  I knew the ZIP file had a CRC32 code that 
would match the correct byte sequence, and figured I could write up a 
brute force recovery for the remaining bytes.


I didn't end up writing the code to do that because I found something 
else: GNU ddrescue.  That can image a drive including as many automatic 
retries as you like, including infinite.  I didn't need the drive right 
away, so I started up ddrescue and let it go after the drive over a 
whole weekend.  There was only one sector on the whole drive that 
ddrescue was working to recover...the one with the file on it.  About 
two days later it finished reading, and when I mounted the drive image, 
I was then able to open up the ZIP file.  The CRC passed and I had 
confirmation that the drive had finally after days of reread attempts 
gotten that last sector.


It was really slow, but I had nothing to lose, and just wanted to see 
what would happen.  I've tried it since on other bad sectors with 
varying results.  Sometimes a couple hundred or thousand retries will 
get a lucky break and recover the sector.  Sometimes not.




The unsuccessful read is thousands of times slower than normal
performance.  It doesn't make your array seem 5x slower during the
fail like the false TLER vs non-TLER comparison makes it seem.  It
makes your array seem entirely frozen.  The actual speed doesn't
matter: it's FROZEN.  Having TLER does not make FROZEN any faster than
FROZEN.
  

I agree.


The story here sounds great, so I can see why it spreads so well:
``during drive failures, the array drags performance a little, maybe
5x, until you locate teh drive and replace it.  However, if you have
used +1 MAGICAL DRIVES OF RECKONING, the dragging is much reduced!
Unfortunately +1 magical drives are only appropriate for ENTERPRISE
use while at home we use non-magic drives, but you get what you pay
for.''  That all sounds fair, reasonable, and like good fun gameplay.
Unfortunately ZFS isn't a video game: it just fucking freezes.

bh> The difference is that a fast fail with ZFS relies on ZFS to
bh> fix the problem rather than degrading the array.

ok but the decision of ``degrading the array'' means ``not sending
commands to the slowly-failing drive any more''.

which is actually the correct decision, the wrong course being to
continue sending commands there and ``patiently waiting'' for them to
fail instead of re-issuing them to redundant drives, even when waiting
thousands of standard deviations outside the mean request time.  TLER
or not, a failing drive will poison the array by making reads
thousands of times slower.
  
I agree.  This is the behavior all RAID type devices should have whether 
hardware or Linux RAID or ZFS.  If a drive is slow to respond, stop 
sending it read commands if there is enough redundancy remaining to 
compute the data.  ZFS should have no problem with this even though I 
understand that it needs t

Re: [zfs-discuss] RAID Z stripes

2010-08-10 Thread Haudy Kazemi

Ian Collins wrote:

On 08/11/10 05:16 AM, Terry Hull wrote:
So do I understand correctly that really the "Right" thing to do is 
to build

a pool not only with a consistent strip width, but also to build it with
drives on only one size?   It also sounds like from a practical point of
view that building the pool full-sized is the best policy so that the 
data

can be spread relatively uniformly across all the drives from the very
beginning.  In my case, I think what I will do is to start with the 16
drives in a single pool and when I need more space, I'll create a new 
pool
and manually move the some of the existing data to the new pool to 
spread

the IO load.

   

That is what I have done when Thumpers fill up!

The other issue here seems to be RAIDZ2 vs RAIDZ3.  I assume there is 
not a

significant performance difference between the two for most loads, but
rather I choose between them based on how badly I want the array to stay
intact.

   
The real issue is how long large capacity drives take to resilver and 
is the risk of loosing a second drive during that window high enough 
to cause concern. In a lot of situations with 2TB drives, it is.




Of course your redundancy requirements may be such that you are okay 
with taking that risk because you have a complete backup elsewhere that 
you can restore from in case a second drive is lost during that recovery 
window.  Evaluate your requirements and see if it makes more sense to 
dedicate more drives to make a RAIDZ3 vs RAIDZ2 (or by the same argument 
RAIDZ2 vs RAIDZ1), than it does to dedicate those additional drives to 
backup. 

Remember that redundancy provided by RAID is not a replacement for 
backup.  (I get the feeling that a little bit of redundancy (RAIDZ1 or 
RAIDZ2) with a little bit of backup (stripe or RAIDZ1) is the best 
compromise overall, with forces such these pushing things one way or the 
other: outage tolerance, time to recover from backup, and the amount of 
pain induced by a complete system failure in case a non-backed up RAIDZ3 
falls apart.)  Running a RAIDZ3 with no backup seems like one is saying 
"I really really want to protect against individual drive failures, but 
if lightning strikes or the water main upstairs breaks I'll give up."  
Running a main pool with a non-redundant stripe and two backups says "I 
can tolerate losing my day-to-day changes to the pool since the last 
backup was run, but I really don't want to lose the bulk of my stuff 
which the backup protects."  Running a main pool with a little bit of 
redundancy (e.g. RAIDZ1) and a backup with redundancy says "I don't like 
downtime nor data loss of my day-to-day changes.  I'm going to guard 
against single drive failures which is the most likely case, but resort 
to my more time-intensive to recover backups if I hit a bad streak and 
get two+ simultaneous drive failures during a long resilver.  My backups 
have redundancy as well because if I need to fall back upon them, then 
it is because I really really need them."


E.g.with 24 x 1TB drives, some options include:
3 x 3x1TB RAIDZ1, 6TB usable for data, 9 drives + 2x (separate pools, 
not striped) 7x1TB RAIDZ1 6TB usable for backup, 14 drives + 1 hot 
spare.  Low strength main pool.  Two complete low strength backups.  Hot 
spare.
3 x 4x1TB RAIDZ2, 6TB usable for data, 12 drives + 2x (separate pools, 
not striped) 6x1TB stripe (no redundancy) 6TB usable for backup, 12 
drives.  Strong main pool.  Two complete but fragile backups.


3 x 4x1TB RAIDZ1, 9TB usable for data, 12 drives + 1x 12x1TB RAIDZ3 9TB 
usable for backup, 12 drives.  Low strength main pool.  Strong backup.
3 x 5x1TB RAIDZ2, 9TB usable for data ,15 drives + 1x 9x1TB stripe (no 
redundancy) 9TB usable for backup, 9 drives.  Strong main pool.  Fragile 
backup.
3 x 6x1TB RAIDZ3, 9TB usable for data ,18 drives + 1x 6x1TB stripe (no 
redundancy) 6TB usable for backup, 6 drives.  Very strong main pool.  
Fragile, possibly incomplete backup (depending on compression and 
whether part of the directory structure is not backed up)


4 x 3x1TB RAIDZ1, 8TB usable for data, 12 drives + 1x 12x1TB RAIDZ3 9TB 
usable for backup, 12 drives + 0 hot spares
4 x 3x1TB RAIDZ1, 8TB usable for data, 12 drives + 1x 10x1TB RAIDZ2 8TB 
usable for backup, 10 drives + 2 hot spares.  Low strength main pool.  
Strong backup.  Hot spares.
4 x 4x1TB RAIDZ1, 12TB usable for data, 16 drives + 1x 12x1TB stripe (no 
redundancy) 12TB usable for backup, 8 drives + 0 hot spares.  Low 
strength main pool.  Fragile backup.
4 x 4x1TB RAIDZ2, 8TB usable for data ,16 drives + 1x 8x1TB stripe (no 
redundancy) 8TB usable for backup, 8 drives + 0 hot spares.  Strong main 
pool.  Fragile backup.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-10 Thread Haudy Kazemi

Peter Taps wrote:

Hi Eric,

Thank you for your help. At least one part is clear now.

I still am confused about how the system is still functional after one disk 
fails.

Consider my earlier example of 3 disks zpool configured for raidz-1. To keep it 
simple let's not consider block sizes.

Let's say I send a write value "abcdef" to the zpool.

As the data gets striped, we will have 2 characters per disk.

disk1 = "ab" + some parity info
disk2 = "cd" + some parity info
disk3 = "ef" + some parity info

Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity info 
may tell me that something is bad but I don't see how my data will get recovered.

The only good thing is that any newer data will now be striped over two disks.

Perhaps I am missing some fundamental concept about raidz.

Regards,
Peter
  


It's done via math and numbers.  :)  In a computer, everything is 
numbers, stored in base 2 (binary)...there are no letters or other 
symbols.  Your sample value of 'abcdef' will be represented as a 
sequence of numbers, probably using the ASCII equivalent numbers, which 
are in turn represented as a binary sequence.


A simplified view of how you can protect multiple independent pieces of 
information with once piece of parity is as follows.
(Note: this simplified view is not exactly how RAID5 or RAIDZ work, as 
they actually make use of XOR at a bitwise level).


Consider an equation with variables (unrelated to your sample value) A, 
B, and P, where A + B = P.  P is the parity value.
A and B are numbers representing your data; they were indirectly chosen 
by you when you created your data.  P is the generated parity value.


If A=97, and B=98, then P=97+98=195.

Each of the three variables is stored on a different disk.  If any one 
variable is lost (the disk failed), the missing variable can be 
recalculated by rearranging the formula and using the known values.


Assuming 'A' was lost, then A=P-B
P-B=195-98
195-98=97
A=97.  Data recovered.

In this simplified example, one piece of parity data P is generated for 
every pair of A and B values that are written.  Special cases handle 
things when only one value needs to be written (zero padding).  For more 
than 3 disks, the formula can expand to variations of A+B+C+D+E+F=P 
where P is the parity.  Additional levels of parity require using more 
complex techniques to generate the needed parity values.


There are lots of other explanations online that might help you out as 
well: http://www.google.com/#hl=en&q=how+raid+works


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS pool and filesystem version list, OpenSolaris builds list

2010-08-15 Thread Haudy Kazemi

Hello,

This is a consolidated list of ZFS pool and filesystem versions, along 
with the builds and systems they are found in. It is based on multiple 
online sources. Some of you may find it useful in figuring out where 
things are at across the spectrum of systems supporting ZFS including 
FreeBSD and FUSE. At the end of this message there is a list of the 
builds OpenSolaris releases and some OpenSolaris derivatives are based 
on. The list is sort-of but not strictly comma delimited, and of course 
may contain errata.


-hk


Solaris Nevada xx = snv_xx = onnv_xx ~= testing builds for Solaris 11
SXCE = Solaris Express Community Edition

ZFS Pool Version, Where found (multiple), Notes about this version
1, Nevada/SXCE 36, Solaris 10 6/06, Initial ZFS on-disk format 
integrated on 10/31/05. During the next six months of internal use, 
there were a few on-disk format changes that did not result in a version 
number change, but resulted in a flag day since earlier versions could 
not read the newer changes. For '6389368 fat zap should use 16k blocks 
(with backwards compatibility)' and '6390677 version number checking 
makes upgrades challenging'
2, Nevada/SXCE 38, Solaris 10 10/06 (build 9), Ditto blocks (replicated 
metadata) for '6410698 ZFS metadata needs to be more highly replicated 
(ditto blocks)'
3, Nevada/SXCE 42, Solaris 10 11/06 (build 3), Hot spares and double 
parity RAID-Z for '6405966 Hot Spare support in ZFS' and '6417978 double 
parity RAID-Z a.k.a. RAID6' and '6288488 du reports misleading size on 
RAID-Z'
4, Nevada/SXCE 62, Solaris 10 8/07, zpool history for '6529406 zpool 
history needs to bump the on-disk version' and '6343741 want to store a 
command history on disk'
5, Nevada/SXCE 62, Solaris 10 10/08, gzip compression algorithm for 
'6536606 gzip compression for ZFS'
6, Nevada/SXCE 62, Solaris 10 10/08, FreeBSD 7.0, 7.1, 7.2, bootfs pool 
property for '4929890 ZFS boot support for the x86 platform' and 
'6479807 pools need properties'
7, Nevada/SXCE 68, Solaris 10 10/08, Separate intent log devices for 
'6339640 Make ZIL use NVRAM when available'
8, Nevada/SXCE 69, Solaris 10 10/08, Delegated administration for 
'6349470 investigate non-root restore/backup'
9, Nevada/SXCE 77, Solaris 10 10/08, refquota and refreservation 
properties for '6431277 want filesystem-only quotas' and '6483677 need 
immediate reservation' and '6617183 CIFS Service - PSARC 2006/715'
10, Nevada/SXCE 78, OpenSolaris 2008.05, Solaris 10 5/09 (Solaris 10 
10/08 supports ZFS version 10 except for cache devices), Cache devices 
for '6536054 second tier ("external") ARC'
11, Nevada/SXCE 94, OpenSolaris 2008.11, Solaris 10 10/09, Improved 
scrub/resilver performance for '6343667 scrub/resilver has to start over 
when a snapshot is taken'
12, Nevada/SXCE 96, OpenSolaris 2008.11, Solaris 10 10/09, added 
Snapshot properties for '6701797 want user properties on snapshot'
13, Nevada/SXCE 98, OpenSolaris 2008.11, Solaris 10 10/09, FreeBSD 7.3+, 
FreeBSD 8.0-RELEASE, Linux ZFS-FUSE 0.5.0, added usedby properties for 
'6730799 want user properties on snapshots' and 'PSARC/2008/518 ZFS 
space accounting enhancements'
14, Nevada/SXCE 103, OpenSolaris 2009.06, Solaris 10 10/09, FreeBSD 
8-STABLE, 8.1-RELEASE, 9-CURRENT, added passthrough-x aclinherit 
property support for '6765166 Need to provide mechanism to optionally 
inherit ACE_EXECUTE' and 'PSARC 2008/659 New ZFS "passthrough-x" ACL 
inheritance rules'
15, Nevada/SXCE 114, added quota property support for '6501037 want 
user/group quotas on ZFS' and 'PSARC 2009/204 ZFS user/group quotas & 
space accounting'
16, Nevada/SXCE 116, Linux ZFS-FUSE 0.6.0, added stmf property support 
for '6736004 zvols need an additional property for comstar support'
17, Nevada/SXCE 120, added triple-parity RAID-Z for '6854612 
triple-parity RAID-Z'
18, Nevada/SXCE 121, Linux zfs-0.4.9, added ZFS snapshot holds for 
'6803121 want user-settable refcounts on snapshots'
19, Nevada/SXCE 125, added ZFS log device removal option for '6574286 
removing a slog doesn't work'
20, Nevada/SXCE 128, added zle compression to support dedupe in version 
21 for 'PSARC/2009/571 ZFS Deduplication Properties'
21, Nevada/SXCE 128, added deduplication properties for 'PSARC/2009/571 
ZFS Deduplication Properties'
22, Nevada/SXCE 128a, Nexenta Core Platform Beta 2, Beta 3, added zfs 
receive properties for 'PSARC/2009/510 ZFS Received Properties'
23, Nevada 135, Linux ZFS-FUSE 0.6.9, added slim ZIL support for 
'6595532 ZIL is too talkative'
24, Nevada 137, added support for system attributes for '6716117 ZFS 
needs native system attribute infrastructure' and '6516171 zpl symlinks 
should have their own object type'

25, Nevada ??, Nexenta Core Platform RC1
26, Nevada 141, Linux zfs-0.5.0


ZFS Pool Version, OpenSolaris, Solaris 10, Description
1 snv_36 Solaris 10 6/06 Initial ZFS version
2 snv_38 Solaris 10 11/06 Ditto blocks (replicated metadata)
3 snv_42 Solaris 10 11/06 Hot spares and double parity RAID-Z
4 snv_

Re: [zfs-discuss] ZFS diaspora (was Opensolaris is apparently dead)

2010-08-15 Thread Haudy Kazemi

For the ZFS diaspora:

1.) For the immediate and near term future (say 1 year), what makes a 
better choice for a new install of a ZFS-class filesystem? Would it be 
FreeBSD 8 with it's older ZFS version (pool version 14), or NexentaCore 
with newer ZFS (pool version 25(?) ), NexentaStor, or something else?  
OpenSolaris 2009.06, Solaris 10 10/09, FreeBSD 8-STABLE and 8.1-RELEASE 
all use pool version 14.  Linux ZFS-FUSE 0.6.9 is at pool version 23, 
and Linux zfs-0.5.0 is at pool vesion 26.


Are there any other ZFS or ZFS-class filesystems on a supported 
distribution that are worthy of consideration for this timeframe?



2.) IllumOS appears to be the likely heir to what was known as 
OpenSolaris.  They have their own mailing lists at 
http://lists.illumos.org/m/listinfo .  Interested community members 
might like to sign up there in case there is a sudden unavailability of 
opensolaris.org and its forums and lists.  Nexenta is sponsoring 
IllumOS.  Nexenta also appears somewhat insulated from the demise of 
OpenSolaris, and is a refuge for several former Sun engineers who were 
active on OpenSolaris.  Genunix.org and the Phoronix.com forums are 
other places to watch.



Other comments inline:


Russ Price wrote:
My guess is that the theoretical Solaris Express 11 will be crippled 
by any or all of: missing features, artificial limits on 
functionality, or a restrictive license. I consider the latter most 
likely, much like the OTN downloads of Oracle DB, where you can 
download and run it for development purposes, but don't even THINK of 
using it as a production server for your home or small business. Of 
course, an Oracle DB is overkill for such a purpose anyway, but that's 
a different kettle of fish.


For me, Solaris had zero mindshare since its beginning, on account of 
being prohibitively expensive. When OpenSolaris came out, I basically 
ignored it once I found out that it was not completely open source, 
since I figured that there was too great a risk of a train wreck like 
we have now. Then, I decided this winter to give ZFS a spin, decided I 
liked it, and built a home server around it - and within weeks Oracle 
took over, tore up the tracks without telling anybody, and made the 
train wreck I feared into a reality. I should have listened to my own 
advice.


As much as I'd like to be proven wrong, I don't expect SX11 to be 
useful for my purposes, so my home file server options are:


1. Nexenta Core. It's maintained, and (somewhat) more up-to-date than 
the late OpenSolaris. As I've been running Linux since the days when a 
486 was a cutting-edge system, I don't mind having a GNU userland. Of 
course, now that Oracle has slammed the door, it'll be difficult for 
it to move forward - which leads to:
1a. NexentaStor Community Edition may also be suitable for home file 
server class uses, depending on your actual storage needs.  It currently 
has a 12 TB limit, measured in actual used capacity.

http://support.nexenta.com/index.php?_m=knowledgebase&_a=viewarticle&kbarticleid=69&nav=0,15


2. IllumOS. In 20/20 hindsight, a project like this should have begun 
as soon as OpenSolaris first came out the door, but better late than 
never. In the short term, it's not yet an option, but in the long 
term, it may be the best (or only) hope. At the very least, I won't be 
able to use it until an open mpt driver is in place.


3. Just stick with b134. Actually, I've managed to compile my way up 
to b142, but I'm having trouble getting beyond it - my attempts to 
install later versions just result in new boot environments with the 
old kernel, even with the latest pkg-gate code in place. Still, even 
if I get the latest code to install, it's not viable for the long term 
unless I'm willing to live with stasis.


4. FreeBSD. I could live with it if I had to, but I'm not fond of its 
packaging system; the last time I tried it I couldn't get the package 
tools to pull a quick binary update. Even IPS works better. I could go 
to the ports tree instead, but if I wanted to spend my time 
recompiling everything, I'd run Gentoo instead.


5. Linux/FUSE. It works, but it's slow.
5a. Compile-it-yourself ZFS kernel module for Linux. This would be a 
hassle (though DKMS would make it less of an issue), but usable - 
except that the current module only supports zvols, so it's not ready 
yet, unless I wanted to run ext3-on-zvol. Neither of these solutions 
are practical for booting from ZFS.


6. Abandon ZFS completely and go back to LVM/MD-RAID. I ran it for 
years before switching to ZFS, and it works - but it's a bitter pill 
to swallow after drinking the ZFS Kool-Aid. 


7.) Linux/BTRFS.  Still green, but moving quickly.  It will have crossed 
a minimum usability and stability threshold when Ubuntu or Fedora is 
willing to support it as default.  Might happen with Ubuntu 11.04, 
although in mid-May there was talk that 10.10 had a slight chance as 
well (but that seems unlikely now).


8.) EON NAS or other OpenS

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Haudy Kazemi

David Dyer-Bennet wrote:

On Sun, August 15, 2010 20:44, Peter Jeremy wrote:

  

Irrespective of the above, there is nothing requiring Oracle to release
any future btrfs or ZFS improvements (or even bugfixes).  They can't
retrospectively change the license on already released code but they
can put a different (non-OSS) license on any new code.



That's true.

However, if Oracle makes a binary release of BTRFS-derived code, they must
release the source as well; BTRFS is under the GPL.

So, if they're going to use it in any way as a product, they have to
release the source.  If they want to use it just internally they can do
anything they want, of course.
  


Technically Oracle could release a non-GPL version of btrfs, if they 
removed (and presumably re-wrote) all the non-Oracle commits to the 
source.  An author is allowed to release programs under multiple 
licenses simultaneously, so if Oracle only uses the Oracle developed 
btrfs code, they could re-release as binary only.  Sorting this out and 
re-writing the code written by others is probably more work than it is 
worth for Oracle so they probably won't do it.  Oracle wouldn't gain any 
friends doing this and would expose themselves to a lot a scrutiny as a 
lot a people watch for GPL violators (this action would be a big yellow 
flag to the other btrfs contributors to look for GPL violations).



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-17 Thread Haudy Kazemi

BM wrote:

On Tue, Aug 17, 2010 at 5:11 AM, Andrej Podzimek  wrote:
  

I did not say there is something wrong about published reports. I often read
them. (Who doesn't?) However, there are no trustworthy reports on this topic
yet, since Btrfs is unfinished. Let's see some examples:

(1) http://www.phoronix.com/scan.php?page=article&item=zfs_ext4_btrfs&num=1



My little few yen in this massacre: Phoronix usually compares apples
with oranges and pigs with candies. So be careful.

  

Disclaimer: I use Reiser4



A "Killer FS"โ„ข. :-)

  


ZFS is the "last word in file systems".

Ben Rockwood's Cuddletech says "Cuddletech: Use Unix or die".
http://www.cuddletech.com/

Both sound pretty final.  Might even be religious OS (operating systems) 
or FS war propaganda...


:)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-18 Thread Haudy Kazemi

Ross Walker wrote:

On Aug 18, 2010, at 10:43 AM, Bob Friesenhahn  
wrote:

  

On Wed, 18 Aug 2010, Joerg Schilling wrote:


Linus is right with his primary decision, but this also applies for static
linking. See Lawrence Rosen for more information, the GPL does not distinct
between static and dynamic linking.
  

GPLv2 does not address linking at all and only makes vague references to the "program".  There is 
no insinuation that the program needs to occupy a single address space or mention of address spaces at all. 
The "program" could potentially be a composition of multiple cooperating executables (e.g. like 
GCC) or multiple modules.  As you say, everything depends on the definition of a "derived work".

If a shell script may be dependent on GNU 'cat', does that make the shell script a 
"derived work"?  Note that GNU 'cat' could be replaced with some other 'cat' 
since 'cat' has a well defined interface.  A very similar situation exists for loadable 
modules which have well defined interfaces (like 'cat').  Based on the argument used for 
'cat', the mere injection of a loadable module into an execution environment which 
includes GPL components should not require that module to be distributable under GPL.  
The module only needs to be distributable under GPL if it was developed in such a way 
that it specifically depends on GPL components.



This is how I see it as well.

The big problem is not the insmod'ing of the blob but how it is distributed.

As far as I know this can be circumvented by not including it in the main 
distribution but through a separate repo to be installed afterwards, ala Debian 
non-free.

-Ross
  


Various distros do the same thing with patent/license encumbered and 
binary-only pieces like some device drivers, applications, and 
multimedia codecs and playback components.  If a user wants that piece 
they click 'yes I still want it'.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 4k block alignment question (X-25E)

2010-08-31 Thread Haudy Kazemi

Christopher George wrote:
What is a "NVRAM" based SSD? 



It is simply an SSD (Solid State Drive) which does not use Flash, 
but does use power protected (non-volatile) DRAM, as the primary 
storage media.


http://en.wikipedia.org/wiki/Solid-state_drive

I consider the DDRdrive X1 to be a NVRAM based SSD even 
though we delineate the storage media used depending on host 
power condition.  The X1 exclusively uses DRAM for all IO 
processing (host is on) and then Flash for permanent non-volatility 
(host is off).
  


NVRAM = non-volatile random access memory.  It is a general category.
EEPROM = electrically-erasable programmable read-only memory.  It is a 
specific type of NVRAM.
Flash memory = memory used in flash devices, commonly NOR or NAND 
based.  It is a specific type of EEPROM, which in turn is a specific 
type of NVRAM.


http://en.wikipedia.org/wiki/Non-volatile_random_access_memory
http://en.wikipedia.org/wiki/EEPROM
http://en.wikipedia.org/wiki/Flash_memory

He means a DRAM based SSD with NVRAM (flash) backup vs. SSDs that use 
NVRAM (flash) directly.  This class of SSD may use DDR DIMMs or may be 
integrated.  Almost all of these devices that retain their data upon 
power loss are technically NVRAM based.  (Exception could be a hard 
drive based device that uses a DRAM cache equal to its hard drive 
storage capacity.)  It is effectively what you would get if you had a 
regular flash based SSD with an internal RAM cache equal in size to the 
nonvolatile storage plus enough energy storage to write out the whole 
cache upon power loss.


I doubt there would be any additional performance beyond what you could 
see from a RAMDISK carved from main memory (actually there would 
probably be theoretical lower performance because of lower bus 
bandwidths).  It does effectively solve the problems posed by 
motherboard physical RAM limits and of an unexpected power loss due to 
failed power supplies or failed UPSes.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Haudy Kazemi

Comment at end...

Mattias Pantzare wrote:

On Wed, Sep 8, 2010 at 15:27, Edward Ned Harvey  wrote:
  

From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of
Mattias Pantzare

It
is about 1 vdev with 12 disk or  2 vdev with 6 disks. If you have 2
vdev you have to read half the data compared to 1 vdev to resilver a
disk.
  

Let's suppose you have 1T of data.  You have 12-disk raidz2.  So you have
approx 100G on each disk, and you replace one disk.  Then 11 disks will each
read 100G, and the new disk will write 100G.

Let's suppose you have 1T of data.  You have 2 vdev's that are each 6-disk
raidz1.  Then we'll estimate 500G is on each vdev, so each disk has approx
100G.  You replace a disk.  Then 5 disks will each read 100G, and 1 disk
will write 100G.

Both of the above situations resilver in equal time, unless there is a bus
bottleneck.  21 disks in a single raidz3 will resilver just as fast as 7
disks in a raidz1, as long as you are avoiding the bus bottleneck.  But 21
disks in a single raidz3 provides better redundancy than 3 vdev's each
containing a 7 disk raidz1.

In my personal experience, approx 5 disks can max out approx 1 bus.  (It
actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks
on a good bus, or good disks on a crap bus, but generally speaking people
don't do that.  Generally people get a good bus for good disks, and cheap
disks for crap bus, so approx 5 disks max out approx 1 bus.)

In my personal experience, servers are generally built with a separate bus
for approx every 5-7 disk slots.  So what it really comes down to is ...

Instead of the Best Practices Guide saying "Don't put more than ___ disks
into a single vdev," the BPG should say "Avoid the bus bandwidth bottleneck
by constructing your vdev's using physical disks which are distributed
across multiple buses, as necessary per the speed of your disks and buses."



This is assuming that you have no other IO besides the scrub.

You should of course keep the number of disks in a vdev low for
general performance reasons unless you only have linear reads (as your
IOPS will be close to what only one disk can give for the whole vdev).
There is another optimization in the Best Practices Guide that says the 
number of devices in a vdev should be (N+P) with P = 1 (raidz), 2 
(raidz2), or 3 (raidz3) and N equals 2, 4, or 8.

I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level.

I.e. Optimal sizes
RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev
RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev
RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Haudy Kazemi

Erik Trimble wrote:

 On 9/9/2010 2:15 AM, taemun wrote:
Erik: does that mean that keeping the number of data drives in a 
raidz(n) to a power of two is better? In the example you gave, you 
mentioned 14kb being written to each drive. That doesn't sound very 
efficient to me.


(when I say the above, I mean a five disk raidz or a ten disk raidz2, 
etc)


Cheers,



Well, since the size of a slab can vary (from 512 bytes to 128k), it's 
hard to say. Length (size) of the slab is likely the better 
determination. Remember each block on a hard drive is 512 bytes (for 
now).  So, it's really not any more efficient to write 16k than 14k 
(or vice versa). Both are integer multiples of 512 bytes.


IIRC, there was something about using a power-of-two number of data 
drives in a RAIDZ, but I can't remember what that was. It may just be 
a phantom memory.


Not a phantom memory...

From Matt Ahrens in a thread titled 'Metaslab alignment on RAID-Z':
http://www.opensolaris.org/jive/thread.jspa?messageID=60241
'To eliminate the blank "round up" sectors for power-of-two blocksizes 
of 8k or larger, you should use a power-of-two plus 1 number of disks in 
your raid-z group -- that is, 3, 5, or 9 disks (for double-parity, use a 
power-of-two plus 2 -- that is, 4, 6, or 10). Smaller blocksizes are 
more constrained; for 4k, use 3 or 5 disks (for double parity, use 4 or 
6) and for 2k, use 3 disks (for double parity, use 4).'



These round up sectors are skipped and used as padding to simplify space 
accounting and improve performance.  I may have referred to them as zero 
padding sectors in other posts, however they're not necessarily zeroed.


See the thread titled 'raidz stripe size (not stripe width)' 
http://opensolaris.org/jive/thread.jspa?messageID=495351



This looks to be the reasoning behind the optimization in the ZFS Best 
Practices Guide that says the number of devices in a vdev should be 
(N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8.

I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level.

I.e. Optimal sizes
RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev
RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev
RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev

The best practices guide recommendation of 3-9 devices per vdev appears 
based on RAIDZ1's optimal size with 3-9 devices when N=1 to 3 in 2^N + P.


Victor Latushkin in a thread titled 'odd versus even' said the same 
thing.  Adam Leventhal said this had a 'very slight space-efficiency 
benefit' in the same thread.

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg05460.html

---
That said, the recommendations in the Best Practices Guide for RAIDZ2 to 
start with 5 disks and RAIDZ3 to start with 8 disks, do not match with 
the last recommendation.  What is the reasoning behind 5 and 8?  
Reliability vs space?

Start a single-parity RAIDZ (raidz) configuration at 3 disks (2+1)
Start a double-parity RAIDZ (raidz2) configuration at 5 disks (3+2)
Start a triple-parity RAIDZ (raidz3) configuration at 8 disks (5+3)
(N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8


Perhaps the Best Practices Guide should also recommend:
-the use of striped vdevs in order to bring up the IOPS number, 
particularly when using enough hard drives to meet the capacity and 
reliability requirements.

-avoiding slow consumer class drives (fast ones may be okay for some users)
-more sample array configurations for common drive chassis capacities
-consider using a RAIDZ1 main pool with RAIDZ1 backup pool rather than 
higher level RAIDZ or mirroring (touch on the value of backup vs. 
stronger RAIDZ)
-watch out for BIOS or firmware upgrades that change host protected area 
(HPA) settings on drives making them appear smaller than before


The BPG should also resolve this discrepancy:
Storage Pools section: "For production systems, use whole disks rather 
than slices for storage pools for the following reasons"
Additional Cautions for Storage Pools: "Consider planning ahead and 
reserving some space by creating a slice which is smaller than the whole 
disk instead of the whole disk."


---


Other (somewhat) related threads:


From Darren Dunham in a thread titled 'ZFS raidz2 number of disks':
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread/dd1b5997bede5265
'> 1 Why is the recommendation for a raidz2 3-9 disk, what are the cons 
for having 16 in a pool compared to 8?
Reads potentially have to pull data from all data columns to reconstruct 
a filesystem block for verification.  For random read workloads, 
increasing the number of columns in the raidz does not increase the read 
iops.  So limiting the column count usually makes sense (with a cost 
tradeoff).  16 is valid, but not recommended.'




From Richard Relling in a thread titled 'rethinking RaidZ and Record size':
http://opensolaris.org/jive/thread.jspa?threadID=121016
'The raidz pathological worst case is a random re

Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Haudy Kazemi

Richard Elling wrote:

On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote:

  

From: Richard Elling [mailto:rich...@nexenta.com]

This operational definition of "fragmentation" comes from the single-
user,
single-tasking world (PeeCees). In that world, only one thread writes
files
from one application at one time. In those cases, there is a reasonable
expectation that a single file's blocks might be contiguous on a single
disk.
That isn't the world we live in, where have RAID, multi-user, or multi-
threaded
environments.
  

I don't know what you're saying, but I'm quite sure I disagree with it.

Regardless of multithreading, multiprocessing, it's absolutely possible to
have contiguous files, and/or file fragmentation.  That's not a
characteristic which depends on the threading model.



Possible, yes.  Probable, no.  Consider that a file system is allocating
space for multiple, concurrent file writers.
  


With appropriate write caching and grouping or re-ordering of writes 
algorithms, it should be possible to minimize the amount of file 
interleaving and fragmentation on write that takes place.  (Or at least 
optimize the amount of file interleaving.  Years ago MFM hard drives had 
configurable sector interleave factors to better optimize performance 
when no interleaving meant the drive had spun the platter far enough to 
be ready to give the next sector to the CPU before the CPU was ready 
with the result that the platter had to be spun a second time around to 
wait for the CPU to catch up.)




Also regardless of raid, it's possible to have contiguous or fragmented
files.  The same concept applies to multiple disks.



RAID works against the efforts to gain performance by contiguous access
because the access becomes non-contiguous.


From what I've seen, defragmentation offers its greatest benefit when 
the tiniest reads are eliminated by grouping them into larger contiguous 
reads.  Once the contiguous areas reach a certain size (somewhere in the 
few Mbytes to a few hundred Mbytes range), further defragmentation 
offers little additional benefit.  Full defragmentation is a useful goal 
when the option of using file carving based data recovery is desirable.  
Also remember that defragmentation is not limited to space used by 
files.  It can also apply to free, unused space, which should also be 
defragmented to prevent future writes from being fragmented on write.


With regard to multiuser systems and how that negates the need to 
defragment, I think that is only partially true.  As long as the files 
are defragmented enough so that each particular read request only 
requires one seek before it is time to service the next read request, 
further defragmentation may offer only marginal benefit.  On the other 
hand, if files from been fragmented down to each sector being stored 
separately on the drive, then each read request is going to take that 
much longer to be completed (or will be interrupted by another read 
request because it has taken too long)..


-hk
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pools inside pools

2010-09-22 Thread Haudy Kazemi

Mattias Pantzare wrote:

On Wed, Sep 22, 2010 at 20:15, Markus Kovero  wrote:
  


Such configuration was known to cause deadlocks. Even if it works now (which I 
don't expect to be the case) it will make your data to be cached twice. The CPU 
utilization > will also be much higher, etc.
All in all I strongly recommend against such setup.
  
--

Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!
  

Well, CPU utilization can be tuned downwards by disabling checksums in inner 
pools as checksumming is done in main pool. I'd be interested in bug id's for 
deadlock issues and everything related. Caching twice is not an issue, 
prefetching could be and it can be disabled
I don't understand what makes it difficult for zfs to handle this kind of 
setup. Main pool (testpool) should just allow any writes/reads to/from volume, 
not caring what they are, where as anotherpool would just work as any other 
pool consisting of any other devices.
This is quite similar setup to iscsi-replicated mirror pool, where you have 
redundant pool created from iscsi volumes locally and remotely.



ZFS needs free memory for writes. If you fill your memory with dirty
data zfs has to flush that data to disk. If that disk is a virtual
disk in zfs on the same computer those writes need more memory from
the same memory pool and you have a deadlock.
If you write to a zvol on a different host (via iSCSI) those writes
use memory in a different memory pool (on the other computer). No
deadlock.
Isn't this a matter of not keeping enough free memory as a workspace?  
By free memory, I am referring to unallocated memory and also 
recoverable main memory used for shrinkable read caches (shrinkable by 
discarding cached data).  If the system keeps enough free and 
recoverable memory around for workspace, why should the deadlock case 
ever arise?  Slowness and page swapping might be expected to arise (as a 
result of a shrinking read cache and high memory pressure), but 
deadlocks too?


It sounds like deadlocks from the described scenario indicate the memory 
allocation and caching algorithms do not perform gracefully in the face 
of high memory pressure.  If the deadlocks do not occur when different 
memory pools are involved (by using a second computer), that tells me 
that memory allocation decisions are playing a role.  Additional data 
should not be accepted for writes when the system determines memory 
pressure is so high that it it may not be able to flush everything to disk.


Here is one article about memory pressure (on Windows, but the issues 
apply cross-OS):

http://blogs.msdn.com/b/slavao/archive/2005/02/01/364523.aspx

(How does virtualization fit into this picture?  If both OpenSolaris 
systems are actually running inside of different virtual machines, on 
top of the same host, have we isolated them enough to allow pools inside 
pools without risk of deadlocks? )


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pools inside pools

2010-09-22 Thread Haudy Kazemi

Erik Trimble wrote:

 On 9/22/2010 11:15 AM, Markus Kovero wrote:
Such configuration was known to cause deadlocks. Even if it works 
now (which I don't expect to be the case) it will make your data to 
be cached twice. The CPU utilization>  will also be much higher, etc.

All in all I strongly recommend against such setup.
--
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!
Well, CPU utilization can be tuned downwards by disabling checksums 
in inner pools as checksumming is done in main pool. I'd be 
interested in bug id's for deadlock issues and everything related. 
Caching twice is not an issue, prefetching could be and it can be 
disabled
I don't understand what makes it difficult for zfs to handle this 
kind of setup. Main pool (testpool) should just allow any 
writes/reads to/from volume, not caring what they are, where as 
anotherpool would just work as any other pool consisting of any other 
devices.
This is quite similar setup to iscsi-replicated mirror pool, where 
you have redundant pool created from iscsi volumes locally and remotely.


Yours
Markus Kovero


Actually, the mechanics of local pools inside pools is significantly 
different than using remote volumes (potentially exported ZFS volumes) 
to build a local pool from.


And, no, you WOULDN'T want to turn off the "inside" pool's checksums.  
You're assuming that this would be taken care of by the outside pool, 
but that's a faulty assumption, since the only way this would happen 
would be if the pools somehow understood they were being nested, and 
thus could "bypass" much of the caching and I/O infrastructure related 
to the inner pool.


What is an example of where a checksummed outside pool would not be able 
to protect a non-checksummed inside pool?  Would an intermittent 
RAM/motherboard/CPU failure that only corrupted the inner pool's block 
before it was passed to the outer pool (and did not corrupt the outer 
pool's block) be a valid example?


If checksums are desirable in this scenario, then redundancy would also 
be needed to recover from checksum failures.




Pools understanding nesting would be a win.  Another win that might 
benefit from this pool-to-pool communication interface would be a ZFS 
client (shim? driver?) that would extend ZFS checksum protection all the 
way out across the network to the workstations accessing ZFS pools.  ZFS 
offers no protection against corruption between the CIFS/NFS server and 
the CIFS/NFS client.  (The client would need to mount the pool directly 
in the current structure).



To quote myself from May 2010:

If someone wrote a "ZFS client", it'd be possible to get over the wire 
data protection.  This would be continuous from the client computer all 
the way to the storage device.  Right now there is data protection from 
the server to the storage device.  The best protected apps are those 
running on the same server that has mounted the ZFS pool containing the 
data they need (in which case they are protected by ZFS checksums and by 
ECC RAM, if present).


A "ZFS client" would run on the computer connecting to the ZFS server, 
in order to extend ZFS's protection and detection out across the network.


In one model, the ZFS client could be a proxy for communication between 
the client and the server running ZFS.  It would extend the filesystem 
checksumming across the network, verifying checksums locally as data was 
requested, and calculating checksums locally before data was sent that 
the server would re-check.  Recoverable checksum failures would be 
transparent except for performance loss, unrecoverable failures would be 
reported as unrecoverable using the standard OS unrecoverable checksum 
error message (Windows has one that it uses for bad sectors on drives 
and optical media).  The local client checksum calculations would be 
useful in detecting network failures, and local hardware instability.  
(I.e. if most/all clients start seeing checksum failures...look at the 
network; if only one client sees checksum failures, check that client's 
hardware.)


An extension to the ZFS client model would allow multi-level ZFS systems 
to better coordinate their protection and recover from more scenarios.  
By multi-level ZFS, I mean ZFS stacked on ZFS, say via iSCSI.  An 
example (I'm sure there are better ones) would be 3 servers, each with 3 
data disks.  Each disk is made into its own non-redundant pool (making 9 
non-redundant pools).  These pools are in turn shared via iSCSI.  One of 
the servers creates RAIDZ1 groups using 1 disk from each of the 3 servers.
With a means for ZFS systems to communicate, a failure of any 
non-redundant lower level device need not trigger a system halt of that 
lower system, because it will know from the higher level system that the 
device can be repaired/replaced using the higher level redundancy.

Re: [zfs-discuss] Pools inside pools

2010-09-23 Thread Haudy Kazemi

Markus Kovero wrote:
What is an example of where a checksummed outside pool would not be able 
to protect a non-checksummed inside pool?  Would an intermittent 
RAM/motherboard/CPU failure that only corrupted the inner pool's block 
before it was passed to the outer pool (and did not corrupt the outer 
pool's block) be a valid example?



  
If checksums are desirable in this scenario, then redundancy would also 
be needed to recover from checksum failures.




That is excellent point also, what is the point for checksumming if you cannot recover from it? 
Checksum errors can tell you there is probably a problem worthy of 
attention.  They can prevent you from making things worse by stopping 
you in your tracks until whatever triggered them is resolved, or enough 
redundancy is available to overcome the errors.  This is why operating 
system kernels panic/abend/BSOD when they detect that the system state 
has been changed in an unknown way which could have unpredictable (and 
likely bad) results on further operations.


Redundancy is useful when you can't recover the data by simply asking 
for it to be re-sent or by getting it from another source.  
Communications buses and protocols will use checksums to detect 
corruption and resends/retries to recover from checksum failures.  That 
strategy doesn't work when you are talking about your end storage media.




At this kind of configuration one would benefit performance-wise not having to 
calculate checksums again.
Checksums in outer pools effectively protect from disk issues, if hardware 
fails so data is corrupted isn't outer pools redundancy going to handle it for 
inner pool also.
Only thing comes to mind is that IF something happens to outerpool, innerpool 
is not aware anymore of possibly broken data which can lead issues.

Yours
Markus Kovero

  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does dedup work over iSCSI?

2010-10-22 Thread Haudy Kazemi

Neil Perrin wrote:

On 10/22/10 15:34, Peter Taps wrote:

Folks,

Let's say I have a volume being shared over iSCSI. The dedup has been 
turned on.


Let's say I copy the same file twice under different names at the 
initiator end. Let's say each file ends up taking 5 blocks.


For dedupe to work, each block for a file must match the 
corresponding block from the other file. Essentially, each pair of 
block being compared must have the same start location into the 
actual data.
  


No,  ZFS doesn't care about the file offset, just that the checksum of 
the blocks matches.




One conclusion is that one should be careful not to mess up file 
alignments when working with large files (like you might have in 
virtualization scenarios).  I.e. if you have a bunch of virtual machine 
image clones, they'll dedupe quite well initially.  However, if you then 
make seemingly minor changes inside some of those clones (like changing 
their partition offsets to do 1mb alignment), you'll lose most or all of 
the dedupe benefits.


General purpose compression tends to be less susceptible to changes in 
data offsets but also has its limits based on algorithm and dictionary 
size.  I think dedupe can be viewed as a special-case of compression 
that happens to work quite well for certain workloads when given ample 
hardware resources (compared to what would be needed to run without dedupe).


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-10-22 Thread Haudy Kazemi




One thing suspicious is that we notice a slow down of one pool when the other 
is under load.  How can that be?

Ian
  
A network switch that is being maxed out?  Some switches cannot switch 
at rated line speed on all their ports all at the same time.  Their 
internal buses simply don't have the bandwidth needed for that.  Maybe 
you are running into that limit?  (I know you mentioned bypassing the 
switch completely in some other tests and not noticing any difference.)


Any other hardware in common?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Newbie ZFS Question: RAM for Dedup

2010-10-22 Thread Haudy Kazemi

Never Best wrote:

Sorry I couldn't find this anywhere yet.  For deduping it is best to have the 
lookup table in RAM, but I wasn't too sure how much RAM is suggested?

::Assuming 128KB Block Sizes, and 100% unique data:
1TB*1024*1024*1024/128 = 8388608 Blocks
::Each Block needs 8 byte pointer?
8388608*8 = 67108864 bytes
::Ram suggest per TB
67108864/1024/1024 = 64MB

So if I understand correctly we should have a min of 64MB RAM per TB for 
deduping? *hopes my math wasn't way off*, or is there significant extra 
overhead stored per block for the lookup table?  For example is there some kind 
of redundancy on the lookup table (relation to RAM space requirments) to 
counter corruption?

I read some articles and they all mention that there is significant performance 
loss if the table isn't in RAM, but none really mentioned how much RAM one 
should have per TB of duping.

Thanks, hope someone can confirm *or give me the real numbers* for me.  I know 
blocksize is variable; I'm most interessted in the default zfs setup right now.
  
There were several detailed discussions about this over the past 6 
months that should be in the archives.  I believe most of the info came 
from Richard Elling.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vdev failure -> pool loss ?

2010-10-22 Thread Haudy Kazemi

Bob Friesenhahn wrote:

On Tue, 19 Oct 2010, Cindy Swearingen wrote:


unless you use copies=2 or 3, in which case your data is still safe
for those datasets that have this option set.


This advice is a little too optimistic. Increasing the copies property
value on datasets might help in some failure scenarios, but probably not
in more catastrophic failures, such as multiple device or hardware
failures.


It is 100% too optimistic.  The copies option only duplicates the user 
data.  While zfs already duplicates the metadata (regardless of copies 
setting), it is not designed to function if a vdev fails.


Bob


Some future filesystem (not zfs as currently implemented) could be 
designed to handle certain vdev failures where multiple vdevs were used 
without redundancy at the vdev level.  In this scenario, the redundant 
metadata and user data with copies=2+ would still be accessible by 
virtue of it having been spread across the vdevs, with at least one copy 
surviving.  Expanding upon this design would allow raw space to be 
added, with redundancy being set by a 'copies' parameter.


I understand the copies parameter to currently be designed and intended 
as an extra assurance against failures that affect single blocks but not 
whole devices.  I.e. run ZFS on a laptop with a single hard drive, and 
use copies=2 to protect against bad sectors but not complete drive 
failures.  I have not tested this, however I imagine that performance is 
the reason to use copies=2 instead of partitioning/slicing the drive 
into two halves and mirroring the two halves back together.  I also 
recall seeing something about the copies parameter attempting to spread 
the copies across different devices, as much as possible.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-10-22 Thread Haudy Kazemi

Tim Cook wrote:



On Fri, Oct 22, 2010 at 10:40 PM, Haudy Kazemi <mailto:kaze0...@umn.edu>> wrote:




One thing suspicious is that we notice a slow down of one pool
when the other is under load.  How can that be?

Ian
 


A network switch that is being maxed out?  Some switches cannot
switch at rated line speed on all their ports all at the same
time.  Their internal buses simply don't have the bandwidth needed
for that.  Maybe you are running into that limit?  (I know you
mentioned bypassing the switch completely in some other tests and
not noticing any difference.)

Any other hardware in common?




There's almost 0 chance a switch is being overrun by a single gigE 
connection.  The worst switch I've seen is roughly 8:1 oversubscribed. 
 You'd have to be maxing out many, many ports for a switch to be a 
problem.


Likely you don't have enough ram or CPU in the box.

--Tim



I agree, but also trying not to assume anything.  Looking back, Ian's 
first email said '10GbE on a dedicated switch'.  I don't think the 
switch model was ever identified...perhaps it is a 1 GbE switch with a 
few 10 GbE ports?  (Drawing at straws.)



What happens when Windows is the iSCSI initiator connecting to an iSCSI 
target on ZFS?  If that is also slow, the issue is likely not in Windows 
or in Linux.


Do CIFS shares (connected to from Linux and from Windows) show the same 
performance problems as iSCSI and NFS?  If yes, this would suggest a 
common cause item on the ZFS side.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Jumping ship.. what of the data

2010-10-27 Thread Haudy Kazemi



Finding PCIe x1 cards with more than 2 SATA ports is difficult so you
might want to make sure that either your chosen motherboard has lots
of PCIe slots or has some wider slots.  If you plan on using on-board
video and re-using the x16 slot for something else, you should verify
that the BIOS will let you do that - I've got several (admittedly old)
systems where the x16 slot must either be empty or have a video card
to work.
  


While it is is not commonly done, it is possible to put faster PCIe 
cards in slower PCIe slots with performance being limited to the lowest 
common denominator.  E.g. a 16x card in a 1x slot.  Doing so with 
probably require the use a a Dremel tool or soldering iron to cut or 
melt off the back of a PCIe slot.  An 8 port SATA PCIe 4x card could be 
used in a 1x slot using this technique (performance limited to 1x).


Examples, some with pictures:

http://forums.whirlpool.net.au/archive/1380104
http://forums.overclockers.com.au/showthread.php?t=790660
http://www.tomshardware.com/forum/249291-30-card
http://forums.pcbsd.org/showthread.php?t=7636
http://www.geekzone.co.nz/forums.asp?forumid=83&topicid=26706


If you are concerned about reliability, you might like to look at
motherboard and CPU combinations that support ECC RAM.  I believe all
Asus AMD boards now support ECC and some Gigabyte boards do (though
identifying them can be tricky).

  
I suggest looking at the ASUS AMD CSM (corporate stability model) 
motherboards.  These models support ECC, don't change as often, and are 
supported longer, which are good characteristics for a build it yourself 
server/workstation.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-11-01 Thread Haudy Kazemi

Ross Walker wrote:

On Nov 1, 2010, at 5:09 PM, Ian D  wrote:

  

Maybe you are experiencing this:
http://opensolaris.org/jive/thread.jspa?threadID=11942
  

It does look like this... Is this really the expected behaviour?  That's just 
unacceptable.  It is so bad it sometimes drop connection and fail copies and 
SQL queries...



Then set the zfs_write_limit_override to a reasonable value.

Depending on the speed of your ZIL and/or backing store (for async IO) you will 
need to limit the write size in such a way so TXG1 is fully committed before 
TXG2 fills.

Myself, with a RAID controller with a 512MB BBU write-back cache I set the 
write limit to 512MB which allows my setup to commit-before-fill.

It also prevents ARC from discarding good read cache data in favor of write 
cache.

Others may have a good calculation based on ARC execution plan timings, disk 
seek and sustained throughput to give an accurate figure based on one's setup, 
otherwise start with a reasonable value, say 1GB, and decrease until the pauses 
stop.

-Ross
  


If this is the root cause, it sounds like some default configuration 
parameters need to be calculated, determined and adjusted differently 
from how they are now.  It is highly preferable that the default 
parameters do not exhibit severe problems.  Defaults should offer 
stability and performance to the extent that stability is not 
compromised.  (I.e. no dedupe by default under its current state).  
(Manually tweaked parameters are a different story in that they should 
allow a knowledgeable user to get a little more performance even if that 
is more risky).


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Any limit on pool hierarchy?

2010-11-08 Thread Haudy Kazemi

Bryan Horstmann-Allen wrote:

+--
| On 2010-11-08 13:27:09, Peter Taps wrote:
| 
| From zfs documentation, it appears that a "vdev" can be built from more vdevs. That is, a raidz vdev can be built across a bunch of mirrored vdevs, and a mirror can be built across a few raidz vdevs.  
| 
| Is my understanding correct? Also, is there a limit on the depth of a vdev?
  
It looks like there is confusion coming from the use of the terms 
virtual device and vdev.  The documentation can be confusing in this regard.


There are two types of vdevs: root and leaf.  A pool's root vdevs are 
usually of the 'mirror' or 'raidz' type, but can also directly use the 
underlying devices if you don't want any redundancy from ZFS 
whatsoever.  Pools dynamically stripe data across all the root vdevs 
present (and not yet full) in that pool at the time the data was written.


Leaf vdevs directly use the underlying devices.  Underlying devices may 
be hard drives, solid state drives, iSCSI volumes, or even files on 
filesystems.  Root vdevs cannot directly be used as underlying devices.




You are incorrect.

The man page states:

 Virtual devices cannot be nested, so a mirror or raidz  vir-
 tual device can only contain files or disks. Mirrors of mir-
 rors (or other combinations) are not allowed.

 A pool can have any number of virtual devices at the top  of
 the  configuration  (known as "root vdevs"). Data is dynami-
 cally distributed across all top-level  devices  to  balance
 data  among  devices.  As new virtual devices are added, ZFS
 automatically places data on the newly available devices.

 A pool can have any number of virtual devices at the top  of
 the  configuration  (known as "root vdevs"). Data is dynami-
 cally distributed across all top-level  devices  to  balance
 data  among  devices.  As new virtual devices are added, ZFS
 automatically places data on the newly available devices.
  


This has been touched on and discussed in some previous threads.  There 
is a way to perform nesting, but it is *not* recommended.  The trick is 
to insert another abstraction layer that hides ZFS from itself (or in 
other words convert a root vdev into an underlying device).  An example 
would be creating iSCSI targets out of a ZFS pool, and then creating a 
second ZFS pool out of those iSCSI targets.  Another example would be 
creating a ZFS pool out of files stored on another ZFS pool.  The main 
reasons that have been given for not doing this are unknown edge and 
corner cases that may lead to deadlocks, and that it creates a complex 
structure with potentially undesirable and unintended performance and 
reliability implications.  Deadlocks may occur in low resource 
conditions.  If resources (disk space and RAM) never run low, the 
deadlock scenarios may not arise.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive randomly being removed from pool

2010-11-08 Thread Haudy Kazemi

besson3c wrote:

This has happened to me several times now, I'm confused as to why...

This one particular drive, and its always the same drive, randomly shows up as 
being removed from the pool. I have to export and import the pool in order to 
have this disk seen again and for re-silvering to reoccur. When this last 
happened I shared some of the logs and nothing of relevance was found as a 
possible cause of this.

I'd like to try this again, and I'd love to hear if you have any ideas as to what might be causing this. Is RAM starvation a possible cause? Hardware problem? 


A hardware problem with the drive itself, or its power or data cable, or 
SATA/SAS port would be my first guess.  I suggest moving the drive to a 
different chassis position, cable, and/or port.  This will help you 
identify whether the problem follows the drive or one of the other 
pieces of hardware.  Try making one change at a time until the problem 
follows the change.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Any limit on pool hierarchy?

2010-11-09 Thread Haudy Kazemi

Maurice Volaski wrote:

I think my initial response got mangled. Oops.

  

creating a ZFS pool out of files stored on another ZFS pool.  The main
reasons that have been given for not doing this are unknown edge and
corner cases that may lead to deadlocks, and that it creates a complex
structure with potentially undesirable and unintended performance and
reliability implications.



Computers are continually encountering unknown edge and corner cases in
the various things they do all the time. That's what we have testing for.
  


I agree.  The earlier discussions of this topic raised the issue that 
this is not a well tested area and is an unsupported configuration.  
Some the of problems that arise in nested pool configurations may also 
arise in supported pool configurations; nested pools may significantly 
aggravate the problems.  The trick is to find test cases in supported 
configurations so the problems can't simply be swept under the rug of 
"unsupported configuration".




Deadlocks may occur in low resource
conditions.  If resources (disk space and RAM) never run low, the
deadlock scenarios may not arise.



It sounds like you mean any low resource condition. Presumably, utilizing
complex pool structures like these will tax resources, but there are many
other ways to do that.
  


We have seen ZFS systems lose stability under low resource conditions.  
They don't always gracefully degrade/throttle back performance as 
resources run very low.


I see a parallel in the 64 bit vs 32 bit ZFS code...the 32 bit code has 
much tighter resource constraints put on it due to memory addressing 
limits, and we see notes in many places that the 32 bit code is not 
production ready and not recommended unless you have no other choice.  
The machines the 32 bit code is run on also tend to have tighter 
physical resource limits, which compounds the problems.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [illumos-Developer] zfs refratio property

2011-06-06 Thread Haudy Kazemi

On 6/6/2011 5:02 PM, Richard Elling wrote:

On Jun 6, 2011, at 2:54 PM, Yuri Pankov wrote:


On Mon, Jun 06, 2011 at 02:19:50PM -0700, Matthew Ahrens wrote:

I have implemented a new property for ZFS, "refratio", which is the
compression ratio for referenced space (the "compressratio" is the ratio for
used space).

Just one question - wouldn't "compressrefratio" (or something similar)
be a better name for the property (I know that it's long, but more
intuitive, in my opinion)?

I'd favor "refcompressratio"
otherwise LGTM
  -- richard


Looks useful.  I favor longer, more description names.  
"refcompressratio" is my preference amongst the three proposed names.


-hk
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz1 faulted with single bad disk. Requesting assistance.

2009-04-22 Thread Haudy Kazemi

Brad Hill wrote:

I've seen reports of a recent Seagate firmware update
bricking drives again.

What's the output of 'zpool import' from the LiveCD?
 It sounds like
ore than 1 drive is dropping off.




r...@opensolaris:~# zpool import
  pool: tank
id: 16342816386332636568
 state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

tankFAULTED  corrupted data
  raidz1DEGRADED
c6t0d0  ONLINE
c6t1d0  ONLINE
c6t2d0  ONLINE
c6t3d0  UNAVAIL  cannot open
c6t4d0  ONLINE

  pool: rpool
id: 9891756864015178061
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

rpool   ONLINE
  c3d0s0ONLINE
  
1.) Here's a similar report from last summer from someone running ZFS on 
FreeBSD.  No resolution there either:

raidz vdev marked faulted with only one faulted disk
http://kerneltrap.org/index.php?q=mailarchive/freebsd-fs/2008/6/15/2132754

2.) This old thread from Dec 2007 for a different raidz1 problem, titled 
'Faulted raidz1 shows the same device twice' suggests trying these 
commands (see the link for the context they were run under):

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg13214.html

# zdb -l /dev/dsk/c18t0d0

# zpool export external
# zpool import external

# zpool clear external
# zpool scrub external
# zpool clear external

3.) Do you have ECC RAM? Have you verified that your memory, cpu, and 
motherboard are reliable?


4.) 'Bad exchange descriptor' is mentioned very sparingly across the 
net, mostly in system error tables.  Also here: 
http://opensolaris.org/jive/thread.jspa?threadID=88486&tstart=165


5.) More raidz setup caveats, at least on MacOS: 
http://lists.macosforge.org/pipermail/zfs-discuss/2008-March/000346.html


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] compression at zfs filesystem creation

2009-06-17 Thread Haudy Kazemi




Bob Friesenhahn wrote:
On Mon, 15 Jun 2009, Bob Friesenhahn wrote:
  
  
  On Mon, 15 Jun 2009, Rich Teer wrote:


You actually have that backwards.ย  :-)ย  In most cases, compression is
very
  
desirable.ย  Performance studies have shown that today's CPUs can
compress
  
data faster than it takes for the uncompressed data to be read or
written.
  


Do you have a reference for such an analysis based on ZFS?ย  I would be
interested in linear read/write performance rather than random access
synchronous access.


Perhaps you are going to make me test this for myself.

  
  
Ok, I tested this for myself on a Solaris 10 system with 4 3GHz AMD64
cores and see that we were both right.ย  I did an iozone run with
compression and do see a performance improvement.ย  I don't know what
the data iozone produces looks like, but it clearly must be quite
compressable.ย  Testing was done with a 64GB file:
  
  
 KBย  reclenย ย  write rewriteย ย ย  readย ย ย  reread
  
uncompressed:ย  67108864 128ย  359965ย  354854ย ย  550869ย ย  554271
  
lzjb:ย  67108864 128ย  851336ย  924881ย  1289059ย  1362625
  
  
Unfortunately, during the benchmark run with lzjb the system desktop
was essentially unusable with misbehaving mouse and keyboard as well as
reported 55% CPU consumption.ย  Without the compression the system is
fully usable with very little CPU consumed.
  

If the system is dedicated to serving files rather than also being used
interactively, it should not matter much what the CPU usage is.ย  CPU
cycles can't be stored for later use.ย  Ultimately, it (mostly*) does
not matter if one option consumes more CPU resources than another if
those CPU resources were otherwise going to go unused.ย  Changes
(increases) in latencies are a consideration but probably depend more
on process scheduler choice and policies.
*Higher CPU usage will increase energy consumption, heat output, and
cooling costs...these may be important considerations in some
specialized dedicated file server applications, depending on
operational considerations.

The interactivity hit may pose a greater challenge for any other
processes/databases/virtual machines run on hardware that also serves
files.ย  The interactivity hit may also be evidence that the process
scheduler is not fairly or effectively sharing CPU resources amongst
the running processes.ย  If scheduler tweaks aren't effective, perhaps
dedicating a processor core(s) to interactive GUI stuff and the other
cores to filesystem duties would help smooth things out.ย  Maybe zones
be used for that?

With a slower disk subsystem the CPU overhead would surely
be less since writing is still throttled by the disk.
  
  
It would be better to test with real data rather than iozone.
  

There are 4 sets of articles with links and snippets from their test
data below.ย  Follow the links for the full discussion:

First article:
http://blogs.sun.com/dap/entry/zfs_compression#comments
Hardware:
Sun Storage 7000
# The server is a quad-core 7410 with 1 JBOD (configured with mirrored
storage) and 16GB of RAM. No SSD.
# The client machine is a quad-core 7410 with 128GB of DRAM.
Summary: text data set

  

  Compression
  Ratio
  Total
  Write
  Read


  off
  1.00x
  3:30
  2:08
  1:22


  lzjb
  1.47x
  3:26
  2:04
  1:22


  gzip-2
  2.35x
  6:12
  4:50
  1:22


  gzip
  2.52x
  11:18
  9:56
  1:22


  gzip-9
  2.52x
  12:16
  10:54
  1:22

  

Summary: media data set

  

  Compression
  Ratio
  Total
  Write
  Read


  off
  1.00x
  3:29
  2:07
  1:22


  lzjb
  1.00x
  3:31
  2:09
  1:22


  gzip-2
  1.01x
  6:59
  5:37
  1:22


  gzip
  1.01x
  7:18
  5:57
  1:21


  gzip-9
  1.01x
  7:37
  6:15
  1:22

  



Second article/discussion:
http://ekschi.com/technology/2009/04/28/zfs-compression-a-win-win/
http://blogs.sun.com/observatory/entry/zfs_compression_a_win_win

Third article summary:
ZFS and MySQL/InnoDB shows that gzip is often cpu-bound on current
processors; lzjb improves performance.
http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/
Hardware:
SunFire
X2200 M2 w/64GB of RAM and 2 x dual-core 2.6GHz Opterons
Dell MD3000 w/15 x 15K SCSI disks and mirrored 512MB battery-backed
write caches
"Also note that this is writing to two DAS enclosures with 15 x 15K
SCSI disks apiece (28 spindles in a striped+mirrored configuration)
with 512MB of write cache apiece."


  

  TABLE1


  compression
  size
  ratio
  time


  uncompressed
  172M
  1
  0.207s


  lzjb
  79M
  2.18X
  0.234s


  gzip-1
  50M
  3.44X
  0.24s

Re: [zfs-discuss] compression at zfs filesystem creation

2009-06-17 Thread Haudy Kazemi

David Magda wrote:

On Tue, June 16, 2009 15:32, Kyle McDonald wrote:

  

So the cache saves not only the time to access the disk but also the CPU
time to decompress. Given this, I think it could be a big win.



Unless you're in GIMP working on JPEGs, or doing some kind of MPEG video
editing--or ripping audio (MP3 / AAC / FLAC) stuff. All of which are
probably some of the largest files in most people's homedirs nowadays.

1 GB of e-mail is a lot (probably my entire personal mail collection for a
decade) and will compress well; 1 GB of audio files is nothing, and won't
compress at all.

Perhaps compressing /usr could be handy, but why bother enabling
compression if the majority (by volume) of user data won't do anything but
burn CPU?

So the correct answer on whether compression should be enabled by default
is "it depends". (IMHO :)  )
  
The performance tests I've found almost universally show LZJB as not 
being cpu-bound on recent equipment.  A few years from now GZIP may get 
away from being cpu-bound.  As performance tests on current hardware 
show that enabling LZJB improves overall performance it would make sense 
to enable it by default.  In the future when GZIP is no longer 
cpu-bound, it might become the default (or there could be another 
algorithm).  There is a long history of previously formidable tasks 
starting out as cpu-bound but quickly progressing to an 'easily handled 
in the background' task.  Decoding MP3 and MPEG1, MPEG2 (DVD 
resolutions), softmodems (and other host signal processor devices), and 
RAID are all tasks that can easily be handled by recent equipment.


Another option/idea to consider is using LZJB as the default compression 
method, and then performing a background scrub-recompress during 
otherwise idle times. Technique ideas:
1.) A performance neutral/performance enhancing technique: use any 
algorithm that is not CPU bound on your hardware, and rarely if ever has 
worse performance than the uncompressed state
2.) Adaptive technique 1: rarely used blocks could be given the 
strongest compression (using an algorithm tuned for the data type 
detected), while frequently used blocks would be compressed at a 
performance neutral or performance improving levels.
3.) Adaptive technique 2: rarely used blocks could be given the 
strongest compression (using an algorithm tuned for the data type 
detected), while frequently used blocks would be compressed at a 
performance neutral or performance improving levels. As the storage 
device gets closer to its native capacity, start applying compression 
both proactively (to new data) and retroactively (to old data), 
progressively using more powerful compression techniques as the maximum 
native capacity is approached.  Compression could delay the users from 
reaching the 80-95% capacity point where system performance curves often 
have their knees (a massive performance degradation with each additional 
unit).
4.) Maximize space technique: detect the data type and use the best 
available algorithm for the block.


As a counterpoint, if drive capacities keep growing at their current 
pace it seems they ultimately risk obviating the need to give much 
thought to the compression algorithm, except to choose one that boosts 
system performance.  (I.e. in hard drives, compression may primarily be 
used to improve performance rather than gain extra storage space, as 
drive capacity has grown many times faster than drive performance.)


JPEGs often CAN be /losslessly/ compressed further by useful amounts 
(e.g. 25% space savings).  There is more on this here:

Tests:
http://www.maximumcompression.com/data/jpg.php
http://compression.ca/act/act-jpeg.html
http://www.downloadsquad.com/2008/09/11/winzip-12-supports-lossless-jpg-compression/
http://download.cnet.com/8301-2007_4-10038172-12.html
http://www.online-tech-tips.com/software-reviews/winzip-vs-7-zip-best-compression-method/

These have source code available:
http://sylvana.net/jpeg-ari/
PAQ8R http://www.cs.fit.edu/~mmahoney/compression/   (general info 
http://en.wikipedia.org/wiki/PAQ )


This one says source code is "not yet available" (implying it may become 
available):

http://www.elektronik.htw-aalen.de/packjpg/packjpg_m.htm


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] compression at zfs filesystem creation

2009-06-17 Thread Haudy Kazemi

Bob Friesenhahn wrote:

On Wed, 17 Jun 2009, Haudy Kazemi wrote:

usable with very little CPU consumed.
If the system is dedicated to serving files rather than also being 
used interactively, it should not matter much what the CPU usage is.  
CPU cycles can't be stored for later use.  Ultimately, it (mostly*) 
does not matter if


Clearly you have not heard of the software flywheel:

  http://www.simplesystems.org/users/bfriesen/software_flywheel.html
I had not heard of such a device, however from the description it 
appears to be made from virtual unobtanium :)


My line of reasoning is that unused CPU cycles are to some extent a 
wasted resource, paralleling the idea that having system RAM sitting 
empty/unused is also a waste and should be used for caching until the 
system needs that RAM for other purposes (how the ZFS cache is supposed 
to work).  This isn't a perfect parallel as CPU power consumption and 
heat outlet do vary by load much more than does RAM.  I'm sure someone 
could come up with a formula for the optimal CPU loading to maximize 
energy efficiency.  There has been work on this the paper 'Dynamic Data 
Compression in Multi-hop Wireless Networks' at 
http://enl.usc.edu/~abhishek/sigmpf03-sharma.pdf .


If I understand the blog entry correctly, for text data the task took 
up to 3.5X longer to complete, and for media data, the task took about 
2.2X longer to complete with a maximum storage compression ratio of 
2.52X.


For my backup drive using lzjb compression I see a compression ratio 
of only 1.53x.


I linked to several blog posts.  It sounds like you are referring to ' 
http://blogs.sun.com/dap/entry/zfs_compression#comments '?
This blog's test results show that on their quad core platform (Sun 7410 
have quad core 2.3 ghz AMD Opteron cpus*) :
* 
http://sunsolve.sun.com/handbook_pub/validateUser.do?target=Systems/7410/spec


for text data, LZJB compression had negligible performance benefits 
(task times were unchanged or marginally better) and less storage space 
was consumed (1.47:1).
for media data, LZJB compression had negligible performance benefits 
(task times were unchanged or marginally worse) and storage space 
consumed was unchanged (1:1).
Take away message: as currently configured, their system has nothing to 
lose from enabling LZJB.


for text data, GZIP compression at any setting, had a significant 
negative impact on write times (CPU bound), no performance impact on 
read times, and significant positive improvements in compression ratio.
for media data, GZIP compression at any setting, had a significant 
negative impact on write times (CPU bound), no performance impact on 
read times, and marginal improvements in compression ratio.
Take away message: With GZIP as their system is currently configured, 
write performance would suffer in exchange for a higher compression 
ratio.  This may be acceptable if the system fulfills a role that has a 
read heavy usage profile of compressible content.  (An archive.org 
backend would be such an example.)  This is similar to the tradeoff made 
when comparing RAID1 or RAID10 vs RAID5.


Automatic benchmarks could be used to detect and select the optimal 
compression settings for best performance, with the basic case assuming 
the system is a dedicated file server and more advanced cases accounting 
for the CPU needs of other processes run on the same platform.  Another 
way would be to ask the administrator what the usage profile for the 
machine will be and preconfigure compression settings suitable for that 
use case.


Single and dual core systems are more likely to become CPU bound from 
enabling compression than a quad core.


All systems have bottlenecks in them somewhere by virtue of design 
decisions.  One or more of these bottlenecks will be the rate limiting 
factor for any given workload, such that even if you speed up the rest 
of the system the process will still take the same amount of time to 
complete.  The LZJB compression benchmarks on the quad core above 
demonstrate that LZJB is not the rate limiter either in writes or 
reads.  The GZIP benchmarks show that it is a rate limiter, but only 
during writes.  On a more powerful platform (6x faster CPU), GZIP writes 
may no longer be the bottleneck (assuming that the network bandwidth and 
drive I/O bandwidth remain unchanged).


System component balancing also plays a role.  If the server is 
connected via a 100 Mbps CAT5e link, and all I/O activity is from client 
computers on that link, does it make any difference if the server is 
actually capable of GZIP writes at 200 Mbps, 500 Mbps, or 1500 Mbps?  If 
the network link is later upgraded to Gigabit ethernet, now only the 
system capable of GZIPing at 1500 Mbps can keep up.  The rate limiting 
factor changes as different components are upgraded.


In many systems for many workloads, hard drive I/O bandwidth is the rate 
limiting factor that has the most significant pe

Re: [zfs-discuss] recover data after zpool create

2009-06-19 Thread Haudy Kazemi

Kees Nuyt wrote:

On Fri, 19 Jun 2009 11:50:07 PDT, stephen bond
 wrote:

  

Kees,

is it possible to get at least the contents of /export/home ?

that is supposedly a separate file system.



That doesn't mean that data is in one particular spot on the
disk. The blocks of the zfilesystems can be interspersed.
  
You can try a recovery tool that supports file carving.  This technique 
looks for files based on their signatures while ignoring damaged, 
nonexistent, or unsupported partition and/or filesystem info.  Works 
best on small files, but gets worse as file sizes increase (or more 
accurately, gets worse as file fragmentation increases).  Should work 
well for files smaller than the stripe size, but possibly not at all for 
compressed files unless you are using a data recovery app that 
understands ZFS compression formats (I don't know of any myself).  
Disable or otherwise do not run scrub or any other command that may 
write to the array until you have exhausted your recovery options or no 
longer care to keep trying.


EasyRecovery supports file carving as does RecoverMyFiles, and 
TestDisk.  I'm sure there are others too.  Not all programs actually 
call it file carving.  The effectiveness of the programs may vary so it 
is worthwhile to try any demo versions.  The programs will need direct 
block level access to the drive...network shares won't work  You can run 
the recovery software on whatever OS it needs, and based on what you are 
asking for, you don't need to seek recovery software that is explicitly 
Solaris compatible.


is there a 
way to look for files using some low level disk reading
tool. If you are old enough to remember the 80s 
there was stuff like PCTools that could read anywhere
on the disk. 



I am old enough. I was the proud owner of a 20 MByte
harddisk back then (~1983).
Disks were so much smaller, you could practically scroll
most of the contents in a few hours.
The on disk data structures are much more complicated now.
  
I recall using a 12.5 Mhz 286 Amdek (Wyse) PC with a 20 mb 3600 rpm 
Miniscribe MFM drive.  A quick Google search for this item says its 
transfer rate specs were 0.625 MB/sec, which sounds about right IIRC (if 
you chose the optimal interleave when formatting.  If you had the wrong 
interleave performance suffered, however I also recall that the drive 
also made less noise.  I think I even ran that drive at a suboptimal 
interleave for a while simply because it was quieter...you could say it 
was an early indirect form of AAM (acoustic management).


To put that drive capacity and transfer rate into comparison with a 
modern drive, you could theoretically fill the 20 mb drive in 
20/0.625=32 seconds.  A 500 GB (base 10) SATA2 drive (WD5000AAKS) has an 
average write rate of 68 MB/sec.  466*1024/68=7012 seconds to fill.  
Capacity growth is significantly out pacing read/write performance, 
which I've seen summed up as modern drives are becoming like the tapes 
of yesteryear.


Those data recovery tools took advantage of the filesystem's design that 
it only erased the index entry (sometimes only a single character in the 
filename) in the FAT.  When NTFS came out, it took a few years for 
unerase and general purpose NTFS recovery to be possible.  This was 
actually a concern of mine and one reason I delayed using NTFS by 
default on several Windows 2000/XP systems.  I waited until good 
recovery tools were available before I committed to the new filesystem 
(in spite of it being journaled, there initially just weren't any 
recovery tools available in case things went horribly wrong, Live CDs 
were not yet available, and there weren't any read/write NTFS tools 
available for DOS or Linux).  In short, graceful degradation and the 
availability of recovery tools is important in selecting a filesystem, 
particularly when used on a desktop that may not have regular backups.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] APPLE: ZFS need bug corrections instead of new func! Or?

2009-06-19 Thread Haudy Kazemi



I think a better question would be: what kind of tests would be most
promising for turning some subclass of these lost pools reported on
the mailing list into an actionable bug?

my first bet would be writing tools that test for ignored sync cache
commands leading to lost writes, and apply them to the case when iSCSI
targets are rebooted but the initiator isn't.

I think in the process of writing the tool you'll immediately bump
into a defect, because you'll realize there is no equivalent of a
'hard' iSCSI mount like there is in NFS.  and there cannot be a strict
equivalent to 'hard' mounts in iSCSI, because we want zpool redundancy
to preserve availability when an iSCSI target goes away.  I think the
whole model is wrong somehow.
  
I'd surely hope that a ZFS pool with redundancy built on iSCSI targets 
could survive the loss of some targets whether due to actual failures or 
necessary upgrades to the iSCSI targets (think OS upgrades + reboots on 
the systems that are offering iSCSI devices to the network.)


My suggestion is use multi-way redundancy with iSCSI...e.g. 3 way 
mirrors or RAIDZ2...so that you can safely offline one of the iSCSI 
targets while still leaving the pool with some redundancy.  Sure there 
is an increased risk while that device is offline, but the window of 
opportunity is small for a failure of the 2nd level redundancy; and even 
then nothing is yet lost until a 3rd device has a fault.  Failures 
should also distinguish between complete failure (e.g. device no longer 
responds to commands whatsoever) and intermittent failure (e.g. a 
"sticky" patch of sectors, or the drive stops responding for a minute 
because it has a non-changeable TLER value that otherwise may cause 
trouble in a RAID configuration).  Drives have a gradation from complete 
failure to flaky to flawless...if the software running on them 
recognizes this, better decisions can be made about what to do when an 
error is encountered rather than the simplistic good/failed model that 
has been used in RAIDs for years.


My preference for storage behavior is that it should not cause a system 
panic (ever).  Graceful error recovery techniques are important.  File 
system error messages should be passed up the line when possible so the 
user can figure out something is amiss with some files (even if not all) 
even though the sysadmin is not around or email notification of problems 
is not working.  If it is possible to returning a CRC errors to a 
network share client, that would seem to be a close match to a 
uncorrectable checksum failure.  (Windows throws these errors when it 
cannot read a CD/DVD.)


A good damage mitigation feature is to provide some mechanism to allow a 
user to ignore the checksum failure as in many user data cases partial 
recovery is preferable to no recovery.  To ensure that damaged files are 
not accidentally confused with good files, ignoring the checksum 
failures might only be allowed through a special "recovery filesystem" 
that only lists damaged files the authenticated user has access to.  
From the network client's perspective, this would be another shared 
folder/subfolder that is only present when uncorrectable, damaged files 
have been found.  ZFS would set up the appropriate links to replicate 
the directory structure of the original as needed to include the damaged 
file.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recovering fs's

2009-06-22 Thread Haudy Kazemi

Matt Harrison wrote:
I know this may have been discussed before but my google-fu hasn't 
turned up anything directly related.


My girlfriend had some files stored in a zfs dataset on my home 
server. She assured me that she didn't need them any more so I 
destroyed the dataset (I know I should have kept it anyway for just 
this occasion).


She's now desperate to get it back as she's realised there some 
important work stuff hidden away in there.


Now there has been data written to the other datasets, but as far as I 
know, no other dataset has been created or destroyed.


Is there any way to recover the dataset and any/all of the data?

Very grateful someone can give me some good news :)


First observations, then suggestions...

I know people who have had variations of this scenario happen:
Person says they have everything they need but actually forgets about 
something, doesn't notice it was misplaced in a different folder, or 
doesn't check other folders.  Computer is restored/rebuilt/reimaged.  
Some data lost.
Admin makes backup of filesystem to another computer.  Rebuilds server.  
Attempts to restore backup but discovers backup was 
incomplete/corrupt/failed silently.  Some data lost.
Person moves data to a shiny new external hard drive.  A week later the 
external drive gets knocked over and dies.  One computer the data was 
copied from has already been wiped in preparation for reuse.  Some data 
lost.


One way to prevent the first case is to get the user to go through all 
their files and explicitly delete the files themselves.  Another is to 
go in after the user and look for content that was likely overlooked.  A 
third is to back it up anyway, even if they didn't say anything was 
needed.  I've heard of computer stores doing this: offering a for-fee 
data backup service when they repair a PC/replace a drive, if the 
customer declines the service, the data is still backed up (to protect 
the company from accidents on their end), and the job is completed.  If 
the customer returns within one week to say "I forgot about x, y, and 
z...can you recover it?" the company still has the backup that they can 
offer it to the customer for a fee higher than the original backup 
service fee (consider data recovery fee levels).


In the second case, verifying that you have a working backup procedure 
is key.  If you aren't making backups and just want to migrate data to a 
new system, bytewise or strong hash file comparisons can catch a lot of 
problems.  On Windows systems, the free Windiff tool works well.  It 
also runs fine under Ubuntu via Wine.


The third case demonstrates data vulnerability, single points of 
failure, and user unfamiliarity with external storage devices.  This 
user now makes backups of his/her most important data to high grade 
DVD+R media and keeps copies out of state.  (I've heard of people using 
the postal services as a defacto off site backup by mailing a DVD every 
day to a friend across the country.  When the friend receives it, they 
open the package, take out a preprinted postage label, apply it to the 
front, and drops it back in the mail to the original person.  Thus, the 
original person has backups in transit all the time and always has one 
on its way back to him.  If there were a catastrophic site failure site 
such as an uncontrolled wildfire, the postal service would be able to 
hold the incoming mail for a few days until further plans are made.)



Suggestions:

You can try a recovery tool that supports file carving.  This technique 
looks for files based on their signatures while ignoring damaged, 
nonexistent, or unsupported partition and/or filesystem info.  Works 
best on small files, but gets worse as file sizes increase (or more 
accurately, gets worse as file fragmentation increases which tends to be 
more of a problem as file size increases).  Should work well for files 
smaller than the stripe size.  Small text based files have the best 
chances for recovery (e.g. TXT, RTF, HTML, XML, CSV) while more complex 
and large formats break more easily (e.g PPT, XLS, JPG).


Do not use the array, and disable or otherwise do not run scrub or any 
other command that may write to the array until you have exhausted your 
recovery options or no longer care to keep trying.


If you had a redundant ZFS system on multiple disks, you may need to run 
the file carver on each disk individually to find what is missing.  If 
this was a ZFS mirror, either disk will likely work, but if it was a 
RAIDZ1 or RAIDZ2 you will almost certainly need to scan all the disks 
that were in the array, because you won't know in advance which disk 
will have the original data and which will have the parity for any given 
file.


If you had compression and/or encryption enabled, standard file carving 
won't get you very far, as the raw disk sectors won't contain the 
patterns the file carver is looking for.  In that case you'd need a data 
recovery application that can handle ZFS compressi

Re: [zfs-discuss] Narrow escape!

2009-06-23 Thread Haudy Kazemi

"scrub: resilver completed after 5h50m with 0 errors on Tue Jun 23 05:04:18 
2009"

Zero errors even though other parts of the message definitely show errors?  
This is described here: http://docs.sun.com/app/docs/doc/819-5461/gbcve?a=view
Device errors do not guarantee pool errors when redundancy is present.

Change suggestion to the ZFS programmers:
insert the phrases 'unrecoverable pool data' or 'pool data' or word 'data' into 
the error message like this:
"scrub: resilver completed after 5h50m with 0 unrecoverable pool data errors on Tue 
Jun 23 05:04:18 2009"
"scrub: resilver completed after 5h50m with 0 pool data errors on Tue Jun 23 
05:04:18 2009"
"scrub: resilver completed after 5h50m with 0 data errors on Tue Jun 23 05:04:18 
2009"

That would clarify why the status message above it doesn't agree (shows device errors) 
nor do the numbers in the config table below it (shows detailed device errors)(3+11+23), 
and it would match with the last line that says "No known data errors".


Ross wrote:

To be honest, never.  It's a cheap server sat at home, and I never got around 
to writing a script to scrub it and report errors.

I'm going to write one now though!  Look at how the resilver finished:

# zpool status
  pool: zfspool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 5h50m with 0 errors on Tue Jun 23 05:04:18 2009
config:

NAMESTATE READ WRITE CKSUM
zfspool ONLINE   0 0 0
  raidz2ONLINE   0 0 0
c1t1d0  ONLINE   0 0 0  188G resilvered
c1t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   3 0 0  128K resilvered
c1t4d0  ONLINE   0 011  473K resilvered
c1t5d0  ONLINE   0 023  986K resilvered

errors: No known data errors
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS, power failures, and UPSes

2009-06-29 Thread Haudy Kazemi

Hello,

I've looked around Google and the zfs-discuss archives but have not been 
able to find a good answer to this question (and the related questions 
that follow it):


How well does ZFS handle unexpected power failures? (e.g. environmental 
power failures, power supply dying, etc.)

Does it consistently gracefully recover?
Should having a UPS be considered a (strong) recommendation or a "don't 
even think about running without it" item?
Are there any communications/interfacing caveats to be aware of when 
choosing the UPS?


In this particular case, we're talking about a home file server running 
OpenSolaris 2009.06.  Actual environment power failures are generally < 
1 per year.  I know there are a few blog articles about this type of 
application, but I don't recall seeing any (or any detailed) discussion 
about power failures and UPSes as they relate to ZFS.  I did see that 
the ZFS Evil Tuning Guide says cache flushes are done every 5 seconds.


Here is one post that didn't get any replies about a year ago after 
someone had a power failure, then UPS battery failure while copying data 
to a ZFS pool:

http://lists.macosforge.org/pipermail/zfs-discuss/2008-July/000670.html

Both theoretical answers and real life experiences would be appreciated 
as the former tells me where ZFS is needed while the later tells me 
where it has been or is now.


Thanks,

-hk
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes (and ZFS recovery guide links)

2009-07-01 Thread Haudy Kazemi

Ian Collins wrote:

David Magda wrote:

On Jun 30, 2009, at 14:08, Bob Friesenhahn wrote:

I have seen UPSs help quite a lot for short glitches lasting 
seconds, or a minute.  Otherwise the outage is usually longer than 
the UPSs can stay up since the problem required human attention.


A standby generator is needed for any long outages.


Can't remember where I read the claim, but supposedly if power isn't 
restored within about ten minutes, then it will probably be out for a 
few hours. If this 'statistic' is true, it would mean that your UPS 
should last (say) fifteen minutes, and after that you really need a 
generator.
Or run your systems of DC and get as much backup as you have room (and 
budget!) for batteries.  I once visited a central exchange with 48 
hours of battery capacity...


The way Google handles UPSes is to have a small 12v battery integrated 
with each PC power supply.  When the machine is on, the battery has its 
charged maintained.  Not unlike a laptop in that it has a built in 
battery backup, but using an inexpensive sealed lead acid battery 
instead of lithium ion.  Here is info along with photos of the Google 
server internals:

http://news.cnet.com/8301-1001_3-10209580-92.html
http://willysr.blogspot.com/2009/04/googles-server-design.html

(IIRC there have been power supply UPSes since at least the late 1980s 
which had an internal battery.  Either that or they were UPSes that fit 
inside the standard PC (AT) compatible desktop case, making the power 
protection system entirely internal to the computer.  I think I saw 
these models one time while browsing late 1980s or early 1990s issues of 
PC Magazine that reviewed UPSes.  They still exist...one company selling 
them is http://www.globtek.com/html/ups.html .  A Google search for 
'power supply built in UPS' would likely find more.)


I also did additional searches in the zfs-discuss archives and found a 
thread from mid-February, which lead me to other threads.  It looks like 
there are still scattered instances where ZFS has not recovered 
gracefully from power failures or other failures, where it became 
necessary to perform a manual transaction group (txg) rollback.  Here is 
a consolidated list of links related to manual uberblock transaction 
group (txg) rollback and similar ZFS data recovery guides, including 
undeleting:


Section 1: Nathan Hand's guide and related thread
Nathan Hand's guide to invalidating uberblocks (Dec 2008 thread)
http://www.opensolaris.org/jive/thread.jspa?threadID=85794
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg22153.html


Section 2. Victor Latushkin's guide and related threads
Thread: zpool unimportable (corrupt zpool metadata??) but no zdb -l 
device problems (Oct 2008 to Feb 2009 thread)

http://www.opensolaris.org/jive/thread.jspa?threadID=76960
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg19839.html

Repair report: Re: Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
http://www.opensolaris.org/jive/message.jspa?messageID=289537#289537

Some recovery discussion by Victor: "zdb -bv alone took several hours to 
walk the block tree"

http://www.opensolaris.org/jive/message.jspa?messageID=292991#292991
or 
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/022365.html

or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg20095.html

Victor Latushkin's guide: "Thanks to COW nature of ZFS it was possible 
to successfully recover pool state which was only 5 seconds older than 
last unopenable one."

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/022331.html
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg20061.html


Section 3: reliability debates, recovery tool planning, uberblock info
Thread: Availability: ZFS needs to handle disk removal / driver failure 
better (August 2008 thread)

http://www.opensolaris.org/jive/thread.jspa?threadID=70811
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg19057.html

Thread: ZFS: unreliable for professional usage? (Feb 2009 thread)
http://www.opensolaris.org/jive/thread.jspa?threadID=91426
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg23833.html

Richard Elling's post that "uberblocks are kept in an 128-entry circular 
queue which is 4x redundant with 2 copies each at the beginning and end 
of the vdev. Other metadata, by default, is 2x redundant and spatially 
diverse."

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg24145.html

Jeff Bonwick's post about Bug ID 6667683
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg23961.html

Bug ID 6667683: need a way to rollback to an uberblock from a previous txg
Description: If we are unable to open the pool based on the most recent 
uberblock then it might be useful to try an older txg uberblock as it 
might provide a better view of the world. Having a utility to reset the 
uberblock to a previous txg might provide a nice recovery mechanism.

http://bugs.opensolaris.org/bugd

Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-07-01 Thread Haudy Kazemi

Erik Trimble wrote:

Neal Pollack wrote:

On 07/ 1/09 05:11 AM, Haudy Kazemi wrote:

Ian Collins wrote:
Or run your systems of DC and get as much backup as you have room 
(and budget!) for batteries.  I once visited a central exchange 
with 48 hours of battery capacity...


The way Google handles UPSes is to have a small 12v battery 
integrated with each PC power supply.  When the machine is on, the 
battery has its charged maintained.  Not unlike a laptop in that it 
has a built in battery backup, but using an inexpensive sealed lead 
acid battery instead of lithium ion.  Here is info along with photos 
of the Google server internals:

http://news.cnet.com/8301-1001_3-10209580-92.html
http://willysr.blogspot.com/2009/04/googles-server-design.html


which is of course why people claim that google is less green than 
detroit :-)


Each sealed lead-acid battery is good for about 2 years in those 
power supplies.

Goodle uses more than 10,000 servers, many more.
Do the math.  That's many many tons of lead and acid in the dump 
every 24 months.



Yes, but...

Lead acid batteries are one of (if not _the_) the most-recycled items 
in the world. Something like 99.99% of all lead-acid batteries get 
fully recycled.
Lead acid batteries are one of the most recycled items, both because it 
makes economic sense and because it is a legal requirement.  According 
to Google's published results, they are also have some of the most power 
efficient systems out there with 90%+ efficient 12v power supplies and 
great Power Usage Efficiency (PUE) numbers: 
http://www.greenm3.com/2009/04/insights-into-googles-pue-a-laptop-approach-to-power-supplies-and-ups-for-servers-achieves-999-efficient-ups-system.html


I'm not convinced by the argument that Google is less green than 
Detroit, and from the smiley it appears this statement was meant as 
tongue-in-cheek humor.


Personally, I don't like Google's solution. That's wy too many 
small batteries in everything.  I'd be more in favor of something like 
a double marine battery every 2 racks. Lots more power, and those 
things are far easier to recondition and reuse  - and much less labor 
intensive to install than 1 battery in 80+ servers.
With a good quality lead acid battery and appropriate charge management 
system, the battery can last the business life of the server without 
replacement (e.g. 4 years).  In that case the batteries could be 
considered 'hands off' and would be replaced as a single unit along with 
the server.  Google has talked about using commodity hardware vs. 
traditional server equipment, and here it looks like they have 
similar-to-commodity hardware optimized for efficiency via their 
leveraging of purchasing power (i.e. custom power supplies and OEM 
Gigabyte motherboards).


The experience people have with lead acid UPS batteries (and lithium 
phone and laptop batteries for that matter) dying in 2 years is 
primarily a function of poor quality batteries and/or poorly designed 
chargers that trickle charge the batteries to death.  (Margins on 
official replacement batteries for UPSes, laptops and phones are high, 
leaving room in the market for refilled batteries and third party 
equivalents.  There isn't much of an incentive to design in a good 
charging system.)  The electric vehicle community knows this well and 
makes sure to use good charging and balancing systems to get their 
batteries to last for hundreds to thousands of cycles over several years 
(UPS systems don't need to cycle very often, but they do need deep cycle 
discharge capability).  Some DIY electric vehicle enthusiasts 
successfully use batteries that in a former life served in UPSes but 
were revived.  More on lead acid charging and care:

Charging Basics: http://www.evdl.org/pages/hartcharge.html
Care Basics: http://www.evdl.org/pages/hartbatt.html

All this said, I certainly do agree that the proper thing to do is 
move to full 12V DC inputs for all computers intended for data center 
use. Eliminating the need for non-12V (i.e. get rid of all the stuff 
that want 5V) on the internal components is really needed to make this 
efficient; that way, all you need in the way of a power supply is 
something that takes 48VDC input, and breaks up the leads into 12V 
outputs. Really cheap, really efficient.  Having a nice 48VDC bus for 
the rack (like Telco) is much more energy efficient and far easier to 
hook something like a small UPS to...
I think it will be hard for 48v in 12v out DC/DC converters to compete 
in price and efficiency with a 240v AC input 12v DC out power supply 
that is 90%+ efficient (a quick Google search for 'power supply 95% 
efficient' finds models as well).  48v DC buses and batteries still need 
to be fed from a power supply of their own.  Google's approach seems 
reasonable, assuming they have integrated a good battery 
charger/maintainer and are running off 240v AC.


__

Re: [zfs-discuss] Open Solaris version recommendation? b114, b117?

2009-07-02 Thread Haudy Kazemi

Jorgen Lundman wrote:


We have been told we can have support for OpenSolaris finally, so we 
can move the ufs on zvol over to zfs with user-quotas.


Does anyone have any feel for the versions of Solaris that has zfs 
user quotas? We will put it on the x4540 for customers.


I have run b114 for about 5 weeks, and have yet to experience any 
problems. But b117 is what 2010/02 version will be based on, so 
perhaps that is a better choice. Other versions worth considering?


I know it's a bit vague, but perhaps there is a known panic in a 
certain version that I may not be aware.


Lund



OpenSolaris 2008.05 is based on b86a from mid-March 2008.  Released May 
5, 2008.  Build locked in about 7 weeks before final release.
OpenSolaris 2008.11 is based on b101 from mid-October 2009.  Released 
Dec 1, 2008.  Build locked in about 6 weeks before final release.
OpenSolaris 2009.06 is based on b111 from mid-March 2009.  Released June 
1, 2009.  Build locked in about 10 weeks before final release.  Previews 
based on earlier builds were also made.


Projection, assuming the patterns above hold and the target release date 
is unchanged (I seem to recall a 6 month release schedule akin to 
Ubuntu's to have been initially planned) :
OpenSolaris 2010.02 final will be based on b128 to b130 from 
November/December 2009.


Release schedule: http://www.opensolaris.org/os/community/on/schedule/
Dates: http://genunix.org/ and http://www.phoronix.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Single disk parity

2009-07-08 Thread Haudy Kazemi

Daniel Carosone wrote:

Sorry, don't have a thread reference
to hand just now.



http://www.opensolaris.org/jive/thread.jspa?threadID=100296

Note that there's little empirical evidence that this is directly applicable to 
the kinds of errors (single bit, or otherwise) that a single failing disk 
medium would produce.  Modern disks already include and rely on a lot of ECC as 
part of ordinary operation, below the level usually seen by the host.  These 
mechanisms seem unlikely to return a read with just one (or a few) bit errors.

This strikes me, if implemented, as potentially more applicable to errors 
introduced from other sources (controller/bus transfer errors, non-ecc memory, 
weak power supply, etc).  Still handy.
  


Adding additional data protection options are commendable.  On the other 
hand I feel there are important gaps in the existing feature set that 
are worthy of a higher priority, not the least of which is the automatic 
recovery of uberblock / transaction group problems (see Victor 
Latushkin's recovery technique which I linked to in a recent post), 
followed closely by a zpool shrink or zpool remove command that lets you 
resize pools and disconnect devices without replacing them.  I saw 
postings or blog entries from about 6 months ago that this code was 
'near' as part of solving a resilvering bug but have not seen anything 
else since.  I think many users would like to see improved resilience in 
the existing features and the addition of frequently long requested 
features before other new features are added.  (Exceptions can readily 
be made for new features that are trivially easy to implement and/or are 
not competing for developer time with higher priority features.)


In the meantime, there is the copies flag option that you can use on 
single disks.  With immense drives, even losing 1/2 the capacity to 
copies isn't as traumatic for many people as it was in days gone by.  
(E.g. consider a 500 gb hard drive with copies=2 versus a 128 gb SSD).  
Of course if you need all that space then it is a no-go.


Related threads that also had ideas on using spare CPU cycles for brute 
force recovery of single bit errors using the checksum:


[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data 
corruption?

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg14720.html

[zfs-discuss] integrated failure recovery thoughts (single-bit correction)
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg18540.html


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Single disk parity

2009-07-08 Thread Haudy Kazemi




Adding additional data protection options are commendable.  On the 
other hand I feel there are important gaps in the existing feature 
set that are worthy of a higher priority, not the least of which is 
the automatic recovery of uberblock / transaction group problems (see 
Victor Latushkin's recovery technique which I linked to in a recent 
post), 


This does not seem to be a widespread problem.  We do see the
occasional complaint on this forum, but considering the substantial
number of ZFS implementations in existence today, the rate seems
to be quite low.  In other words, the impact does not seem to be high.
Perhaps someone at Sun could comment on the call rate for such
conditions?
I counter this.  The user impact is very high when the pool is 
completely inaccessible due to a minor glitch in the ZFS metadata, and 
the user is told to restore from backups, particularly if they've been 
considering snapshots to be their backups (I know they're not the same 
thing).  The incidence rate may be low, but the impact is still high, 
and anecdotally there have been enough reports on list to know it is a 
real non-zero event probability.  Think earth-asteroid 
collisions...doesn't happen very often but is catastrophic when it does 
happen.  Graceful handling of low incidence high impact events plays a 
role in real world robustness and is important in widescale adoption of 
a filesystem.  It is about software robustness in the face of failure 
vs. brittleness.  (In another area, I and others found MythTV's 
dependence on MySQL to be source of system brittleness.)  Google adopts 
robustness principles in its Google File System (GFS) by not trusting 
the hardware at all and then keeping a minimum of three copies of 
everything on three separate computers.


Consider the users/admin's dilemma of choosing between a filesystem that 
offers all the great features of ZFS but can be broken (and is 
documented to have broken) with a few miswritten bytes, or choosing a 
filesystem with no great features but is also generally robust to wide 
variety of minor metadata corrupt issues.  Complex filesystems need to 
take special measures that their complexity doesn't compromise their 
efforts at ensuring reliability.  ZFS's extra metadata copies provide 
this versus simply duplicating the file allocation table as is done in 
FAT16/32 filesystems (a basic filesystem).  The extra filesystem 
complexity also makes users more dependent upon built in recovery 
mechanisms and makes manual recovery more challenging. (This is an 
unavoidable result of more complicated filesystem design.)


More below.
followed closely by a zpool shrink or zpool remove command that lets 
you resize pools and disconnect devices without replacing them.  I 
saw postings or blog entries from about 6 months ago that this code 
was 'near' as part of solving a resilvering bug but have not seen 
anything else since.  I think many users would like to see improved 
resilience in the existing features and the addition of frequently 
long requested features before other new features are added.  
(Exceptions can readily be made for new features that are trivially 
easy to implement and/or are not competing for developer time with 
higher priority features.)


In the meantime, there is the copies flag option that you can use on 
single disks.  With immense drives, even losing 1/2 the capacity to 
copies isn't as traumatic for many people as it was in days gone by.  
(E.g. consider a 500 gb hard drive with copies=2 versus a 128 gb 
SSD).  Of course if you need all that space then it is a no-go.


Space, performance, dependability: you can pick any two.



Related threads that also had ideas on using spare CPU cycles for 
brute force recovery of single bit errors using the checksum:


There is no evidence that the type of unrecoverable read errors we
see are single bit errors.  And while it is possible for an error 
handling

code to correct single bit flips, multiple bit flips would remain as a
large problem space.  There are error codes which can correct multiple
flips, but they quickly become expensive.  This is one reason why nobody
does RAID-2.
Expensive in CPU cycles or engineering resources or hardware or 
dollars?  If the argument is CPU cycles, then that is the same case made 
against software RAID as a whole and an argument increasingly broken by 
modern high performance CPUs.  If the argument is engineering resources, 
consider the complexity of ZFS itself.  If the argument is hardware, 
(e.g. you need a lot of disks) why not run it at the block level?  
Dollars is going to be a function of engineering resources, hardware, 
and licenses.


There are many error correcting codes available.  RAID2 used Hamming 
codes, but that's just one of many options out there.  Par2 uses 
configurable strength Reed-Solomon to get multi bit error correction.  
The par2 source is available, although from a ZFS perspective is 
hindered by the CDDL-GPL license incompatib

Re: [zfs-discuss] Single disk parity

2009-07-10 Thread Haudy Kazemi

Richard Elling wrote:
There are many error correcting codes available.  RAID2 used Hamming 
codes, but that's just one of many options out there.  Par2 uses 
configurable strength Reed-Solomon to get multi bit error 
correction.  The par2 source is available, although from a ZFS 
perspective is hindered by the CDDL-GPL license incompatibility problem.


It is possible to write a FUSE filesystem using Reed-Solomon (like 
par2) as the underlying protection.  A quick search of the FUSE 
website turns up the Reed-Solomon FS (a FUSE-based filesystem):
"Shielding your files with Reed-Solomon codes" 
http://ttsiodras.googlepages.com/rsbep.html


While most FUSE work is on Linux, and there is a ZFS-FUSE project for 
it, there has also been FUSE work done for OpenSolaris:

http://www.opensolaris.org/os/project/fuse/


BTW, if you do have the case where unprotected data is not
readable, then I have a little DTrace script that I'd like you to run
which would help determine the extent of the corruption.  This is
one of those studies which doesn't like induced errors ;-)
http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon
Is this intended as general monitoring script or only after one has 
otherwise experienced corruption problems?



It is intended to try to answer the question of whether the errors we see
in real life might be single bit errors.  I do not believe they will 
be single

bit errors, but we don't have the data.

To be pedantic, wouldn't protected data also be affected if all 
copies are damaged at the same time, especially if also damaged in 
the same place?


Yep.  Which is why there is RFE CR 6674679, complain if all data
copies are identical and corrupt.
-- richard


There is a related but an unlikely scenario, that is also probably not 
covered yet.  I'm not sure what kind of common cause would lead to it.  
Maybe a disk array turning into swiss cheese with bad sectors suddenly 
showing up on multiple drives?  Its probability increases with larger 
logical block sizes (e.g. 128k blocks are at higher risk than 4k blocks; 
a block being the smallest piece of storage real estate used by the 
filesystem).  It is the edge case of multiple damaged copies where the 
damage is unreadable bad sectors on different corresponding sectors of a 
block.  This could be recovered from by copying the readable sectors 
from each copy and filling in the holes using the appropriate sectors 
from the other copies.  The final result, a rebuilt block, should pass 
the checksum tests assuming there were not any other problems with the 
still readable sectors.


---

A bad sector specific recovery technique is to instruct the disk to 
return raw read data rather than trying to correct it.  The READ LONG 
command can do this (though the specs say it only works on 28 bit LBA).  
(READ LONG corresponds to writes done with WRITE LONG (28 bit) or WRITE 
UNCORRECTABLE EXT (48 bit).  Linux HDPARM uses these write commands when 
it is used to create bad sectors with the --make-bad-sector command.  
The resulting sectors are low level logically bad where the sector's 
data and ECC do not match; they are not physically bad).  With multiple 
read attempts, a statistical distribution of the likely 'true' contents 
of the sector can be found.  Spinrite claims to do this.  Linux 'HDPARM 
--read-sector' can sometimes return data from nominally bad sectors too 
but it doesn't have a built in statistical recovery method (a wrapper 
script could probably solve that).  I don't know if HDPARM --read sector 
uses READ LONG or not.

HDPARM man page: http://linuxreviews.org/man/hdparm/

Good description of IDE commands including READ LONG and WRITE LONG 
(specs say they are 28 bit only)

http://www.repairfaq.org/filipg/LINK/F_IDE-tech.html
SCSI versions of READ LONG and WRITE LONG
http://en.wikipedia.org/wiki/SCSI_Read_Commands#Read_Long
http://en.wikipedia.org/wiki/SCSI_Write_Commands#Write_Long

Here is a report by forum member "qubit" modifying his Linux taskfile 
driver to use READ LONG for data recovery purposes, and his subsequent 
analysis:


http://forums.storagereview.net/index.php?showtopic=5910
http://www.tech-report.com/news_reply.x/3035
http://techreport.com/ja.zz?comments=3035&page=5

-- quote --
318. Posted at 07:00 am on Jun 6th 2002 by qubit

My DTLA-307075 (75GB 75GXP) went bad 6 months ago. But I didn't write 
off the data as being unrecoverable. I used WinHex to make a ghost image 
of the drive onto my new larger one, zeroing out the bad sectors in the 
target while logging each bad sector. (There were bad sectors in the FAT 
so I combined the good parts from FATs 1 and 2.) At this point I had a 
working mirror of the drive that went bad, with zeroed-out 512 byte 
holes in files where the bad sectors were.


Then I set the 75GXP aside, because I knew it was possible to recover 
some of the data *on* the bad sectors, but I didn't have the tools to do 
it. So I decided to wait until then to RMA it.


I did 

Re: [zfs-discuss] Motherboard for home zfs/solaris file server

2009-07-23 Thread Haudy Kazemi

chris wrote:

Ok, so the choice for a MB boils down to:

- Intel desktop MB, no ECC support
  
This is mostly true.  The exceptions are some implementations of the 
Socket T LGA 775 (i.e. late Pentium 4 series, and Core 2) D975X and X38 
chipsets, and possibly some X48 boards as well.  Intel's other desktop 
chipsets do not support ECC.  Some motherboard examples include:


Intel DX38BT - ECC support is mentioned in the documentation and is a 
BIOS option
Gigabyte GA-X38-DS4, GA-EX38-DS4 - ECC support is mentioned in the 
documentation and is listed in the website FAQ

The Sun Ultra 24 also uses the X38 chipset.

It's not clear how well ECC support has actually been implemented on the 
Intel and Gigabyte boards, i.e. whether it is simply unbuffered ECC 
memory compatible, or actually able to initialize and use the ECC 
capability.  I mentioned the X48 chipset above because discussions 
surrounding it say it is just a higher binned X38 chip.


On Linux, the EDAC project maintains software to manage the 
motherboard's ECC capability.  A list of memory controllers currently 
supported by Linux EDAC is here:

http://buttersideup.com/edacwiki/Main_Page

A prior discussion thread in 'fm' titled 'X38/975x ECC memory support' 
is here:

http://opensolaris.org/jive/thread.jspa?threadID=52440&tstart=60

Thread links:
http://www.madore.org/~david/linux/#ECC_for_82x
http://developmentonsolaris.wordpress.com/2008/03/12/intel-82975x-mch-and-logging-of-ecc-events-on-solaris/

Note that the 'ecccheck.pl' script depends on the 'pcitweak' utility 
which is no longer present in OpenSolaris 2009.06 and Ubuntu 8.10 
because of Xorg changes.  One Linux user needing the utility copied it 
from another distro.  The version of pcitweak included with previous 
versions of OpenSolaris might work on 2009.06.

http://opensolaris.org/jive/thread.jspa?threadID=105975&tstart=90
http://ubuntuforums.org/showthread.php?t=1054516

Finally, on unbuffered ECC memory prices and speeds...they are a bit 
behind in price and speed vs. regular unbuffered RAM, but both are still 
reasonable.  Keep When comparing prices, remember that ECC RAM uses 9 
chips where non-ECC uses 8, so expect at least a 12.5% price increase.  
Consider:


DDR2: $64 for Crucial 4GB kit (2GBx2), 240-pin DIMM, Unbuffered DDR2 
PC2-6400 memory module

http://www.crucial.com/store/partspecs.aspx?IMODULE=CT2KIT25672AA800

DDR3: $108 for Crucial 6GB (3 x 2GB) 240-Pin DDR3 SDRAM ECC Unbuffered 
DDR3 1333 (PC3 10600) Triple Channel Kit Server Memory Model 
CT3KIT25672BA1339 - Retail

http://www.newegg.com/Product/Product.aspx?Item=N82E16820148259

-hk


- Intel server MB, ECC support, expensive (requires a Xeon for speedstep 
support). It is a shame to waste top kit doing nothing 24/7.
- AMD K8: ECC support(right?), no Cool'n'quiet support (but maybe still cool 
enough with the right CPU?)
- AMD K10: should have the best all of both worlds: ECC support, Cool'n'quiet, 
cheap-ish and lowish-power CPU like Athlon II 250

Is my understanding correct? Like many I want reliable, cheap, low power, ECC 
supporting MB. Integrated video and low power chipset would be best. The sata 
ports will have to come from an additional controller it seems, but that's life.

Intel gear is best supported, but they shoot themselves (or is that that us?) 
in the foot with their ECC-on-server MB policy.

AMD K10 seems the most tempting as it has it all. I wonder about solaris support though. For example, is an AM3 MB OK with solaris? 


I'd like this hopefully to work right away with opensolaris 2009.06, without 
fiddling with drivers, I dont have much time or skills.

What AM3 MB do you guys know that is trouble free with solaris? 


If none, maybe top quality ram (suggestions?) would allow me to forego ECC and 
use a well supported low power intel board (suggestions?) instead? and a E5200?

Thanks for your insight.
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] utf8only and normalization properties

2009-08-12 Thread Haudy Kazemi

Hello,

I'm wondering what are some use cases for ZFS's utf8only and 
normalization properties.  They are off/none by default, and can only be 
set when the filesystem is created.  When should they specifically be 
enabled and/or disabled?  (i.e. Where is using them a really good idea?  
Where is using them a really bad idea?)


Looking forward, starting with Windows XP and OS X 10.5 clients, is 
there any reason to change the defaults in order to minimize problems?


From the documentation at 
http://dlc.sun.com/osol/docs/content/ZFSADMIN/gazss.html :


utf8only
Boolean
Off
This property indicates whether a file system should reject file names 
that include characters that are not present in the UTF-8 character code 
set. If this property is explicitly set to off, the normalization 
property must either not be explicitly set or be set to none. The 
default value for the utf8only property is off. This property cannot be 
changed after the file system is created.


normalization
String
None
This property indicates whether a file system should perform a unicode 
normalization of file names whenever two file names are compared, and 
which normalization algorithm should be used. File names are always 
stored unmodified, names are normalized as part of any comparison 
process. If this property is set to a legal value other than none, and 
the utf8only property was left unspecified, the utf8only property is 
automatically set to on. The default value of the normalization property 
is none. This property cannot be changed after the file system is created


Background: I've built a test system running OpenSolaris 2009.06 (b111) 
with a ZFS RAIDZ1, with CIFS in workgroup mode.  I'm testing with 
Windows XP and Mac OS X 10.5 clients connecting via CIFS (no NFS or AFP).

I've set these properties during zfs create or immediately afterwards:
casesensitivity=mixed
compression=on
snapdir=visible

and ran this to set up nonrestrictive ACLs as suggested by Alan Wright 
at the thread "[cifs-discuss] CIFS and permission mapping" at 
http://opensolaris.org/jive/message.jspa?messageID=365620#365947

chmod A=everyone@:full_set:fd:allow /tank/home

Thanks!

-hk
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] unsetting/resetting ZFS properties

2009-08-12 Thread Haudy Kazemi

Hello,

I recently asked myself this question: Is it possible to unset ZFS 
properties?  Or reset one to its default state without looking up what 
that default state is?
I believe the answer is yes, via the zfs inherit command (I haven't 
verified yet, but I think a case could be made to add functionality to 
the zfs set command...or the documentation...to make this clearer.)


An example:
You have a pool named tank.
You have created a filesystem called 'home' and it has a child 
filesystem called 'smith'.

You run: zfs set compression=on tank/home
which turns on compression on the 'home' filesystem (it is a local 
property) and on the 'smith' filesystem (as an inherited property).  
(You inspect the properties with 'zfs get'.)


You then run: zfs set compression=on tank/home/smith
which makes compression on the 'smith' filesystem also be a local property.

At this point you decide you would rather that the compression property 
for filesystem 'smith' be inherited after all, not be a local property 
anymore.


You run:
zfs set compression=off tank/home/smith
but that doesn't unset the compression setting for filesystem 'smith', 
it just overrides the inheritance of compression=on (as expected).


So how to unset/reset?

In looking for an answer I went back to the page where I found the 
available properties and their valid parameters:

Introducing ZFS Properties
http://docs.sun.com/app/docs/doc/817-2271/gazss?l=en&a=view

I didn't see anything under 'zfs set' or under the 'compression' section 
for how to unset a property.  I did find a link to this page:

Setting ZFS Properties
http://docs.sun.com/app/docs/doc/817-2271/gazsp?l=en&a=view

which had a link to this page:
man pages section 1M: System Administration Commands
http://docs.sun.com/app/docs/doc/819-2240/zfs-1m?l=en&a=view

which talked about 'zfs inherit' and 'zfs set':
zfs inherit [-r] property filesystem|volume|snapshot ...
zfs set property=value filesystem|volume|snapshot ...

*
In short, I think an alias for 'zfs inherit' could be added to 'zfs set' 
to make it more clear to those of us still new to ZFS.  Either that, or 
add some additional pointers in the Properties documentation that the 
set command can't unset/reset properties.

The alias could work like this:
If someone issues a command like this:
zfs set property=inherit filesystem|volume|snapshot
then run this code path:
zfs inherit property filesystem|volume|snapshot

The -r command could be added to 'zfs set' as well, to allow 'zfs set' 
to recursively set local properties on child filesystems.

zfs set -r property=inherit filesystem|volume|snapshot
then run this code path:
zfs inherit -r property filesystem|volume|snapshot

Another example if zfs set was extended:
zfs set -r compression=on tank/home
would set a local property of compression=on for 'home' and each of its 
child filesystems.  (new functionality)


zfs set -r compression=inherit tank/home
would set the property of compression to default for 'home' and each of 
its child filesystems.  (alias of zfs inherit -r compression tank/home)


zfs set compression=inherit tank/home
would set the property of compression to default for 'home' and leave 
the child filesystems properties untouched (alias of zfs inherit 
compression tank/home)



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] unsetting/resetting ZFS properties

2009-08-13 Thread Haudy Kazemi




In short, I think an alias for 'zfs inherit' could be added to 'zfs 
set' to make it more clear to those of us still new to ZFS.  Either 
that, or add some additional pointers in the Properties 
documentation that the set command can't unset/reset properties.


That would to me be confusing it would also complicate the code quite 
a lot because now the action would be part of the value for a 
different subcommand.   It also won't work at all for some 
properties, in particular what if you have a hostname called 
"inherit" that is going to be very confusing for the share* properties.
Maybe I'm missing something, but what would a specific example be?  I 
don't see one in the docs that would create a conflict.


I see the following valid options for sharesmb at 
http://docs.sun.com/app/docs/doc/817-2271/gfwqv?l=en&a=view and at 
http://docs.sun.com/app/docs/doc/817-2271/gfwpk?l=en&a=view:

# zfs set sharesmb=off sandbox/fs2
# zfs set sharesmb=on sandbox
# zfs set sharesmb=name=myshare sandbox/fs2
The documentation says this in regards to 'name': "A pseudo property 
/name /is also supported that allows you to replace the dataset name 
with a specific name. The specific name is then used to replace the 
prefix dataset in the case of inheritance."


I see the following valid options for sharenfs at 
http://docs.sun.com/app/docs/doc/817-2271/gamnd?l=en&a=view :

# zfs set sharenfs=on tank/home
# zfs set sharenfs=ro tank/home/tabriz
# zfs set sharenfs=off tank/home

The documentation says this: The sharenfs property is a comma-separated 
list of options to pass to the share command. The special value on is an 
alias for the default share options, which are read/write permissions 
for anyone. The special value off indicates that the file system is not 
managed by ZFS and can be shared through traditional means, such as the 
/etc/dfs/dfstab file. All file systems whose sharenfs property is not 
off are shared during boot.


'inherited' would be one more special value.

If there is an issue here I believe we should first trying and 
resolve it with documentation changes.
Some UI guides say there is room for improvement in the design of a 
system if it isn't self-evident/self-documenting to reasonably informed 
people.  As this is happening to tech savvy people (assuming that those 
who are using/trying out OpenSolaris and ZFS are relatively tech savvy), 
this is particularly evident. 



I'd have to say that probably most customers I've worked with on zfs 
have fallen over this one, given up trying to work out how to do it, 
and had to ask (and the first time it happened, I had to ask too). The 
obvious things they have tried are generally something along the lines


zfs set foobar=inherit[ed] ...

There is something unnatural about the 'zfs inherit' command -- it 
just isn't where that functionality is expected to be, based on the 
structure of the other commands.

This is exactly what happened to me.

I had tried:
zfs set compression=off tank/home/smith  (had the logical result of 
setting a local property)

zfs set compression=default tank/home/smith
zfs set compression=inherit tank/home/smith
zfs set compression=inherited tank/home/smith

None of which did what I wanted (to set the value to 
default/inherited).  The commands 'zfs get' and 'zfs set' felt natural 
to view/set/edit filesystem properties.  Editing the property to go back 
to the default/inherited setting really feels like something that 
belongs under 'zfs set', rather than as a top level command.  The 
documentation examples show 'zfs set' as being 'property=value', and 
'value' can take on various text or numerical settings depending on 
which property is being changed.  The intuitive setting is to have a 
'value' that unsets/resets.  The general thought process: "I used 'zfs 
set' to change the value of the property, now I want to change it back 
to what it was, so why should I expect to need to use a different top 
level command? "


I'd like the properties documentation to show all the valid range of 
values for a given property, if practical.  If a 'zfs inherit'/'zfs set' 
alias was created, one of these values would be 'inherit' or 'inherited' 
or 'default'.  On a related note, the documentation (same URL below) for 
the "normalization" property is missing the list of valid values besides 
"none".


In regards to potential edge/corner(?) case conflicts with the share* 
properties, the documentation at 
http://docs.sun.com/app/docs/doc/817-2271/gazss?l=en&a=view says:


Under sharenfs:
If set to on, the zfs share command is invoked with no options. 
Otherwise, the zfs share command is invoked with options equivalent to 
the contents of this property.


Under sharesmb:
If the property is set to on, the sharemgr command is invoked with no 
options. Otherwise, the sharemgr command is invoked with options that 
are equivalent to the contents of this property.


This tells me these two properties are already in effect aliases of 
other commands.

Re: [zfs-discuss] utf8only and normalization properties

2009-08-13 Thread Haudy Kazemi

Nicolas Williams wrote:

On Wed, Aug 12, 2009 at 06:17:44PM -0500, Haudy Kazemi wrote:
  
I'm wondering what are some use cases for ZFS's utf8only and 
normalization properties.  They are off/none by default, and can only be 
set when the filesystem is created.  When should they specifically be 
enabled and/or disabled?  (i.e. Where is using them a really good idea?  
Where is using them a really bad idea?)



These are for interoperability.

The world is converging on Unicode for filesystem object naming.  If you
want to exclude non-Unicode strings then you should set utf8only (some
non-Unicode strings in some codesets can look like valid UTF-8 though).

But Unicode has multiple canonical and non-canonical ways of
representing certain characters (e.g., ยด).  Solaris and Windows
input methods tend to conform to NFKC, so they will interop even if you
don't enable the normalization feature.  But MacOS X normalizes to NFD.

Therefore, if you need to interoperate with MacOS X then you should
enable the normalization feature.
  
Thank you for the reply. My goal is to configure the filesystem for the 
lowest common denominator without knowing up front which clients will be 
used. OS X and Win XP are listed because they are commonly used as 
desktop OSes.  Ubuntu Linux is a third potential desktop OS.


The normalization property documentation says "this property indicates 
whether a file system should perform a unicode normalization of file 
names whenever two file names are compared.  File names are always 
stored unmodified, names are normalized as part of any comparison 
process."  Where does the file system use filename comparisons and what 
does it use them for?  Filename collision checking?  Sorting?


Is it used for any other operation, say when returning a filename to an 
application?  Would applications reading/writing files to a ZFS 
filesystem ever notice the difference in normalization settings as long 
as they produce filenames that do not conflict with existing names or 
create invalid UTF8?  The documentation says filenames are stored 
unmodified, which sounds like things should be transparent to applications.


(In regard to filename collision checking, if non-normalized unmodified 
filenames are always stored on disk, and they don't conflict in 
non-normalized form, what would the point be of normalizing the 
filenames for a comparison?  To verify there isn't conflict in 
normalized forms, and if there is no conflict with an existing file to 
allow the filename to be written unmodified?)



Looking forward, starting with Windows XP and OS X 10.5 clients, is 
there any reason to change the defaults in order to minimize problems?



You should definetely enable normalization (see above).

It doesn't matter what normalization form you use, but "nfd" runs faster
than "nfc".

The normalization feature doesn't cost much if you use all US-ASCII file
names.  And it doesn't cost much if your file names are mostly US-ASCII.

Nico
  
The ZFS documentation doesn't list the valid values for the 
normalization property other than 'none.  From your reply and from the 
the official unicode docs at
http://unicode.org/reports/tr15/ and 
http://unicode.org/faq/normalization.html
would it be correct to conclude that none, NFD, NFC, NFKC, and NFKD are 
the only valid values for the ZFS normalization property?  If so, I 
suggest they be added to the documentation at

http://dlc.sun.com/osol/docs/content/ZFSADMIN/gazss.html

Thanks,

-hk





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Compression algorithms - Project Proposal

2007-07-09 Thread Haudy Kazemi
On Jul 9 2007, Domingos Soares wrote:

>Hi,
>
>> It might be interesting to focus on compression algorithms which are
>> optimized for particular workloads and data types, an Oracle database for
>> example.
>
>Yes, I agree. That is what I meant when I said "The study might be
>extended to the analysis of data in specific applications (e.g. web
>servers, mail servers and others) in order to develop compression
>schemes for specific environments...". However, I was not considering
>it as a major task, but a minor one. How important such a feature
>would be to opensolaris?

Some specific cases where you could find extra compression would be:
-differencing multiple versions of documents (doc/xls/html) (this type of 
delta compression is currently possible using SmartVersion from 
http://www.smartversion.com/ I haven't seen delta compression in other 
non-backup related compression tools; as I understand it, ZFS snapshots are 
filesystem-wide deltas)

-media types known to be further recompressible: some older AVI and 
QuickTime video actually compress quick well using ZIP or RAR. The RAR 
format itself has a multimedia compression option to enable algorithms that 
work better on multimedia content.

>> It might be worthwhile to have some sort of adaptive compression whereby
>> ZFS could choose a compression algorithm based on its detection of the
>> type of data being stored.
>
>  That's definitely a great idea. I'm just afraid that would be a bit
>hard to identify the data type of a given block or set of blocks in
>order to adapt the compression algorithm to it. At the file level it
>would be pretty easy in most cases, but at the block level we don't
>have a clue about what kind of data are inside the block. The
>identification process would depend on some statistical properties of
>the data and I don't know how hard it would be to scan the blocks and
>process them on a reasonable amount of time, and the whole thing must
>be done before the compression really starts.

Wouldn't ZFS's being an integrated filesystem make it easier for it to 
identify the file types vs. a standard block device with a filesystem 
overlaid upon it?

I read in another post that with compression enabled, ZFS attempts to 
compress the data and stores it compressed if it compresses enough. As far 
as identifying the file type/data type how about:
1.) ZFS block compression system reads the ZFS file table to identify which 
blocks are the beginning of files (or for new writes, the block compression 
system is notified that file.ext is being written on block  (e.g. block 
9,000,201).
2.) ZFS block compression system reads block , identifies the file type 
probably based on the file header and applies the most appropriate 
compression format, or if none found, the default

An approach for maximal compression:
The algorithm selection could be
1.) attempt to compress using BWT, store compressed if BWT works better 
than no compression
2.) when CPU is otherwise idle, use 10% of spare cpu cycles to "walk the 
disk", trying to recompress each block with each of the various supported 
compression algorithms, ultimately storing that block in the most space 
efficient compression format.

This technique would result in a file system that tends to compact its data 
ever more tightly as the data sits in it. It could be compared to 
'settling' flakes in a cereal box...the contents may have had a lot of 'air 
space' before shipment, but are now 'compressed'. The recompression step 
might even be part of a period disk scrubbing step meant to check and 
recheck previously written data to make sure the sector it is sitting on 
isn't going bad.

An aging (2002) but thorough comparison of many archivers/algorithms is 
Jeff Gilchrist's Archive Comparison Test: http://compression.ca/ 
http://compression.ca/act/act-win.html

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] General recommendations on raidz groups of different sizes

2007-07-21 Thread Haudy Kazemi
How would one calculate system reliability estimates here? One is a RAIDZ 
set of 6 disks, the other a set of 8. The reliability of each RAIDZ set by 
itself isn't too hard to calculate, but put together, especially since 
they're different sizes, I don't know.

On Jul 19 2007, Richard Elling wrote:

>After a cup of French coffee, I feel strong enough to recommend :-)
>
>David Smith wrote:
>> What are your thoughts or recommendations on having a zpool made up of 
>> raidz groups of different sizes? Are there going to be performance 
>> issues?
>
> It is more complicated and, in general, more complicated is a bad thing. 
> But in your example, with only 2 top-level vdevs, it isn't overly 
> complicated.
>
> Performance issues will be difficult to predict because this hasn't been 
> studied. With the gazillions of possible permutations, it is not likely 
> to be extensively characterized. But it if it works for you, then be 
> happy :-)
>  -- richard
>
>> For example:
>> 
>>   pool: testpool1
>>  state: ONLINE
>>  scrub: none requested
>> config:
>> 
>> NAME STATE READ WRITE CKSUM
>> testpool1 ONLINE 0 0 0
>> 
>>   raidz1 ONLINE 0 0 0
>> c12t600A0B800029E5EA07234685122Ad0 ONLINE 0 0 0
>> c12t600A0B800029E5EA07254685123Cd0 ONLINE 0 0 0
>> c12t600A0B800029E5EA072F46851256d0 ONLINE 0 0 0
>> c12t600A0B800029E5EA073146851266d0 ONLINE 0 0 0
>> c12t600A0B800029E5EA073746851278d0 ONLINE 0 0 0
>> c12t600A0B800029E5EA074146851292d0 ONLINE 0 0 0
>> c12t600A0B800029E5EA0747468512B6d0 ONLINE 0 0 0
>> c12t600A0B800029E5EA0749468512C2d0 ONLINE 0 0 0
>>   raidz1 ONLINE 0 0 0
>> c12t600A0B800029E5EA074F468512E0d0 ONLINE 0 0 0
>> c12t600A0B800029E5EA0751468512E8d0 ONLINE 0 0 0
>> c12t600A0B800029E5EA07574685130Cd0 ONLINE 0 0 0
>> c12t600A0B800029E5EA075946851318d0 ONLINE 0 0 0
>> c12t600A0B800029E5EA075F4685132Ed0 ONLINE 0 0 0
>> c12t600A0B800029E5EA076546851342d0 ONLINE 0 0 0
>> 
>> 
>> Thanks,
>> 
>> David
>>  
>>  
>> This message posted from opensolaris.org
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>___
>zfs-discuss mailing list
>zfs-discuss@opensolaris.org
>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] incorrect/conflicting suggestion in error message on a faulted pool

2008-04-07 Thread Haudy Kazemi
Hello,

I'm writing to report what I think is an incorrect or conflicting 
suggestion in the error message displayed on a faulted pool that does 
not have redundancy (equiv to RAID0?).  I ran across this while testing 
and learning about ZFS on a clean installation of NexentaCore 1.0.

Here is how to recreate the scenario:

[EMAIL PROTECTED]:~$ mkfile 200m testdisk1 testdisk2
[EMAIL PROTECTED]:~$ sudo zpool create mybigpool $PWD/testdisk1 $PWD/testdisk2
Password:
[EMAIL PROTECTED]:~$ zpool status mybigpool
  pool: mybigpool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
mybigpool ONLINE   0 0 0
  /export/home/kaz/testdisk1  ONLINE   0 0 0
  /export/home/kaz/testdisk2  ONLINE   0 0 0

errors: No known data errors
[EMAIL PROTECTED]:~$ sudo zpool scrub mybigpool
[EMAIL PROTECTED]:~$ zpool status mybigpool
  pool: mybigpool
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Mon Apr  7 22:09:29 2008
config:

NAME  STATE READ WRITE CKSUM
mybigpool ONLINE   0 0 0
  /export/home/kaz/testdisk1  ONLINE   0 0 0
  /export/home/kaz/testdisk2  ONLINE   0 0 0

errors: No known data errors

Up to here everything looks fine.  Now lets destroy one of the virtual 
drives:

[EMAIL PROTECTED]:~$ rm testdisk2
[EMAIL PROTECTED]:~$ zpool status mybigpool
  pool: mybigpool
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Mon Apr  7 22:09:29 2008
config:

NAME  STATE READ WRITE CKSUM
mybigpool ONLINE   0 0 0
  /export/home/kaz/testdisk1  ONLINE   0 0 0
  /export/home/kaz/testdisk2  ONLINE   0 0 0

errors: No known data errors

Okay, still looks fine, but I haven't tried to read/write to it yet.  
Try a scrub.

[EMAIL PROTECTED]:~$ sudo zpool scrub mybigpool
[EMAIL PROTECTED]:~$ zpool status mybigpool
  pool: mybigpool
 state: FAULTED
status: One or more devices could not be opened.  Sufficient replicas 
exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: scrub completed after 0h0m with 0 errors on Mon Apr  7 22:10:36 2008
config:

NAME  STATE READ WRITE CKSUM
mybigpool FAULTED  0 0 0  
insufficient replicas
  /export/home/kaz/testdisk1  ONLINE   0 0 0
  /export/home/kaz/testdisk2  UNAVAIL  0 0 0  cannot 
open

errors: No known data errors
[EMAIL PROTECTED]:~$

There we go.  The pool has faulted as I expected to happen because I 
created it as a non-redundant pool.  I think it was the equivalent of a 
RAID0 pool with checksumming, at least it behaves like one.  The key to 
my reporting this is that the "status" message says "One or more devices 
could not be opened.  Sufficient replicas exist for the pool to continue 
functioning in a degraded state." while the message further down to the 
right of the pool name says "insufficient replicas".

The verbose status message is wrong in this case.  From other forum/list 
posts looks like that status message is also used for degraded pools, 
which isn't a problem, but here we have a faulted pool.  Here's an 
example of the same status message used appropriately: 
http://mail.opensolaris.org/pipermail/zfs-discuss/2006-April/031298.html

Is anyone else able to reproduce this?  And if so, is there a ZFS bug 
tracker to report this too? (I didn't see a public bug tracker when I 
looked.)

Thanks,

Haudy Kazemi
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] incorrect/conflicting suggestion in error message on a faulted pool

2008-04-09 Thread Haudy Kazemi
I have reported this bug here: 
http://bugs.opensolaris.org/view_bug.do?bug_id=6685676

I think this bug may be related, but I do not see where to add a note to 
an existing bug report: 
http://bugs.opensolaris.org/view_bug.do?bug_id=6633592
(both bugs refer to ZFS-8000-2Q however my report shows a FAULTED pool 
instead of a DEGRADED pool.)

Thanks,

-hk

Haudy Kazemi wrote:
> Hello,
>
> I'm writing to report what I think is an incorrect or conflicting 
> suggestion in the error message displayed on a faulted pool that does 
> not have redundancy (equiv to RAID0?).  I ran across this while testing 
> and learning about ZFS on a clean installation of NexentaCore 1.0.
>
> Here is how to recreate the scenario:
>
> [EMAIL PROTECTED]:~$ mkfile 200m testdisk1 testdisk2
> [EMAIL PROTECTED]:~$ sudo zpool create mybigpool $PWD/testdisk1 $PWD/testdisk2
> Password:
> [EMAIL PROTECTED]:~$ zpool status mybigpool
>   pool: mybigpool
>  state: ONLINE
>  scrub: none requested
> config:
>
> NAME  STATE READ WRITE CKSUM
> mybigpool ONLINE   0 0 0
>   /export/home/kaz/testdisk1  ONLINE   0 0 0
>   /export/home/kaz/testdisk2  ONLINE   0 0 0
>
> errors: No known data errors
> [EMAIL PROTECTED]:~$ sudo zpool scrub mybigpool
> [EMAIL PROTECTED]:~$ zpool status mybigpool
>   pool: mybigpool
>  state: ONLINE
>  scrub: scrub completed after 0h0m with 0 errors on Mon Apr  7 22:09:29 2008
> config:
>
> NAME  STATE READ WRITE CKSUM
> mybigpool ONLINE   0 0 0
>   /export/home/kaz/testdisk1  ONLINE   0 0 0
>   /export/home/kaz/testdisk2  ONLINE   0 0 0
>
> errors: No known data errors
>
> Up to here everything looks fine.  Now lets destroy one of the virtual 
> drives:
>
> [EMAIL PROTECTED]:~$ rm testdisk2
> [EMAIL PROTECTED]:~$ zpool status mybigpool
>   pool: mybigpool
>  state: ONLINE
>  scrub: scrub completed after 0h0m with 0 errors on Mon Apr  7 22:09:29 2008
> config:
>
> NAME  STATE READ WRITE CKSUM
> mybigpool ONLINE   0 0 0
>   /export/home/kaz/testdisk1  ONLINE   0 0 0
>   /export/home/kaz/testdisk2  ONLINE   0 0 0
>
> errors: No known data errors
>
> Okay, still looks fine, but I haven't tried to read/write to it yet.  
> Try a scrub.
>
> [EMAIL PROTECTED]:~$ sudo zpool scrub mybigpool
> [EMAIL PROTECTED]:~$ zpool status mybigpool
>   pool: mybigpool
>  state: FAULTED
> status: One or more devices could not be opened.  Sufficient replicas 
> exist for
> the pool to continue functioning in a degraded state.
> action: Attach the missing device and online it using 'zpool online'.
>see: http://www.sun.com/msg/ZFS-8000-2Q
>  scrub: scrub completed after 0h0m with 0 errors on Mon Apr  7 22:10:36 2008
> config:
>
> NAME  STATE READ WRITE CKSUM
> mybigpool FAULTED  0 0 0  
> insufficient replicas
>   /export/home/kaz/testdisk1  ONLINE   0 0 0
>   /export/home/kaz/testdisk2  UNAVAIL  0 0 0  cannot 
> open
>
> errors: No known data errors
> [EMAIL PROTECTED]:~$
>
> There we go.  The pool has faulted as I expected to happen because I 
> created it as a non-redundant pool.  I think it was the equivalent of a 
> RAID0 pool with checksumming, at least it behaves like one.  The key to 
> my reporting this is that the "status" message says "One or more devices 
> could not be opened.  Sufficient replicas exist for the pool to continue 
> functioning in a degraded state." while the message further down to the 
> right of the pool name says "insufficient replicas".
>
> The verbose status message is wrong in this case.  From other forum/list 
> posts looks like that status message is also used for degraded pools, 
> which isn't a problem, but here we have a faulted pool.  Here's an 
> example of the same status message used appropriately: 
> http://mail.opensolaris.org/pipermail/zfs-discuss/2006-April/031298.html
>
> Is anyone else able to reproduce this?  And if so, is there a ZFS bug 
> tracker to report this too? (I didn't see a public bug tracker when I 
> looked.)
>
> Thanks,
>
> Haudy Kazemi
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diverse, Dispersed, Distributed, Unscheduled RAID volumes

2008-04-25 Thread Haudy Kazemi

Brandon High wrote:

On Fri, Apr 25, 2008 at 4:48 AM, kilamanjaro <[EMAIL PROTECTED]> wrote:
  

Is ZFS ready today to link a set of dispersed desktop computers (diverse
operating systems) into a distributed RAID volume that supports desktops



It sounds like you'd want to use something like Lustre or Hadoop, both
of which are only supported on Linux.

I remember there being an application in the Windows 95/98 timeframe
that did what you want, but do idea on what it was called, how well it
worked, or if it still exists.

-B

  
I remember that Win95/98 tool as well.  It was advertised in the product 
catalogs of the day (~1996/1997/1998) like TigerDirect or PCConnection, 
etc.  It let you create a shared storage drive that used space on 
multiple computers on the network, without needing all the PCs to be on 
at the same time.  FWIW, the Google File System uses a policy of simply 
keeping 3 copies of everything.  The suggested scenario should 
definitely use multiple drive mirrors vdevs, and if they're done as 
files on CIFS/SMB shares, that iscsi delay mentioned in this thread may 
not occur.


-hk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diverse, Dispersed, Distributed, Unscheduled RAID volumes

2008-04-25 Thread Haudy Kazemi

Brandon High wrote:

On Fri, Apr 25, 2008 at 4:48 AM, kilamanjaro <[EMAIL PROTECTED]> wrote:
  

Is ZFS ready today to link a set of dispersed desktop computers (diverse
operating systems) into a distributed RAID volume that supports desktops



It sounds like you'd want to use something like Lustre or Hadoop, both
of which are only supported on Linux.

I remember there being an application in the Windows 95/98 timeframe
that did what you want, but do idea on what it was called, how well it
worked, or if it still exists.

-B

  
I did some searching.and found the product you and I were thinking of.  
It was called Medley97 and versions definitely existed for Windows 95 
and NT.  It let you pool storage from many desktops, and attempted to 
always keep at least two copies of something available online.


Here's info on it and some other similar software (see the Hadoop 
compatibility and Lustre w/ZFS notes too):


Medley97
"Medley97 is a virtually zero-administration, plug-and-play network 
operating system that creates a pooled network drive and disk cache from 
unused disk space and free memory on workstations. Available: Now (well 
if you live in 1997, that is). $695 per server MangoSoft Corp. (888) 
88-MANGO; fax (508) 898-9166 [EMAIL PROTECTED] or www.mango.com"

http://ask.slashdot.org/comments.pl?sid=447752&threshold=1&commentsort=0&mode=thread&cid=22367030
Medley97 takes distributed computing a step further
http://www.mangosoft.com/news/pa/pa_0002_-_INFOWorld_-_distributed_computing.asp
Mango's Medley97 Achieves Best of COMDEX Honors
http://www.mangosoft.com/news/pr/19971124.asp
Mango pooling is the biggest idea we've seen since network computers
http://www.mangosoft.com/news/pa/pa_0009_-_infoworld_-_mango_pooling.asp
http://www.networkcomputing.com/902/902ff.html
MojoNation ... Corporate Backup Tool?
http://developers.slashdot.org/article.pl?no_d2=1&sid=02/07/18/0244256
"Mango Medley 97 did this 5 years ago"
http://developers.slashdot.org/comments.pl?sid=36257&threshold=1&commentsort=0&mode=thread&no_d2=1&cid=3907959

FreeLoader : Scavenged Distributed Storage System
http://www.ece.ubc.ca/~samera/projects/freeloader/
http://www.csm.ornl.gov/~vazhkuda/Morsels/

vanDisk: An Exploration in Peer-To-Peer Collaborative Back-Up Storage
http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/4232658/4232659/04232719.pdf?tp=&isnumber=&arnumber=4232719
http://www.ece.ubc.ca/~matei/496/OldProjects/2007.04-vanDisk-ArminBahramshahry.pdf

Making Use of Terabytes of Unused Storage
http://ask.slashdot.org/article.pl?sid=08/02/09/1319258

Hadoop
http://hadoop.apache.org/core/
http://wiki.apache.org/hadoop/ProjectDescription
1. What is Hadoop? 
Hadoop is a distributed computing platform written in Java. It 
incorporates features similar to those of the  Google File System and 
of  MapReduce. For some details, see HadoopMapReduce.

2. What platform does Hadoop run on?
Java 1.5.x or higher, preferably from Sun
Linux and Windows are the supported operating systems, but BSD and Mac 
OS/X are known to work. (Windows requires the installation of  Cygwin).


Lustre
http://wiki.lustre.org/index.php?title=Main_Page
"Lustre is a scalable, secure, robust, highly-available cluster file 
system. It is designed, developed and maintained by Sun Microsystems, Inc. "
"Even before the acquisition, Sun declared its intentions to marry 
Lustre to its own ZFS file system to produce a general-purpose, 
high-capacity parallel file system solution. ZFS is Sun's Solaris-based 
file system for applications that require very large storage capacity. 
For true scalability, the only element missing was a clustering 
capability, which they now have in Lustre."

http://www.hpcwire.com/blogs/17903424.html
"Lustre 1.8 will allow users to choose between ZFS and ldiskfs as 
back-end storage."

http://en.wikipedia.org/wiki/Lustre_(file_system)#ZFS_integration
Lustre to run on ZFS (3/26/2008)
"The development teams hopes to get a version of the ZFS-compatible 
Lustre released by the end of the year. "

http://www.gcn.com/online/vol1_no1/46011-1.html

-hk

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss