Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-28 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Eff Norwood
> 
> We tried all combinations of OCZ SSDs including their PCI based SSDs and
> they do NOT work as a ZIL. After a very short time performance degrades
> horribly and for the OCZ drives they eventually fail completely. 

This was something interesting I found recently.  Apparently for flash
manufacturers, flash hard drives are like the pimple on the butt of the
elephant. A vast majority of the flash production in the world goes into
devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
into hard drives.  As a result, they optimize for these other devices, and
one of the important side effects is that standard flash chips use an 8K
page size.  But hard drives use either 4K or 512B.  

The SSD controller secretly remaps blocks internally, and aggregates small
writes into a single 8K write, so there's really no way for the OS to know
if it's writing to a 4K block which happens to be shared with another 4K
block in the 8K page.  So it's unavoidable, and whenever it happens, the
drive can't simply write.  It must read modify write, which is obviously
much slower.

Also if you look up the specs of a SSD, both for IOPS and/or sustainable
throughput...  They lie.  Well, technically they're not lying because
technically it is *possible* to reach whatever they say.  Optimize your
usage patterns and only use blank drives which are new from box, or have
been fully TRIM'd.  Pt...  But in my experience, reality is about 50% of
whatever they say.

Presently, the only way to deal with all this is via the TRIM command, which
cannot eliminate the read/modify/write, but can reduce their occurrence.
Make sure your OS supports TRIM.  I'm not sure at what point ZFS added TRIM,
or to what extent...  Can't really measure the effectiveness myself.

Long story short, in the real world, you can expect the DDRDrive to crush
and shame the performance of any SSD you can find.  It's mostly a question
of PCIe slot versus SAS/SATA slot, and other characteristics you might care
about, like external power, etc.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best choice - file system for system

2011-01-28 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Tristram Scott
> 
> When it comes to dumping and restoring filesystems, there is still no
official
> replacement for the ufsdump and ufsrestore.  

Let's go into that a little bit.  If you're piping zfs send directly into
zfs receive, then it is an ideal backup method.  But not everybody can
afford the disk necessary to do that, so people are tempted to "zfs send" to
a file or tape.  There are precisely two reasons why that's not "officially"
recommended:
1- When you want to restore, it's all or nothing.  You can't selectively
restore a single file.
2- When you want to restore, it's all or nothing.  If a single bit is
corrupt in the data stream, the whole stream is lost.

Regarding point #2, I contend that zfs send is better than ufsdump.  I would
prefer to discover corruption in the backup, rather than blindly restoring
it undetected.  Also, since the invention of zstreamdump, you are able to
detect any corruption during stream generation...  And you are able to
verify integrity of a stream after it is written to its destination.  All of
this serves to minimize the importance of point #2.

Regarding point #1, I'll agree ufsdump has an advantage, which is ability to
do a selective restore.  Again, ZFS does have an answer to this, which is to
pipe the send directly into a receive.  Not always possible, but that's the
answer.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-28 Thread Deano
Hi Edward,
Do you have a source for the 8KiB block size data? whilst we can't avoid the
SSD controller in theory we can change the smallest size we present to the
SSD to 8KiB fairly easily... I wonder if that would help the controller do a
better job (especially with TRIM)

I might have to do some test, so far the assumption (even inside sun's sd
driver) is that SSD are really 4KiB even when the claim 512B, perhaps we
should have an 8KiB option...

Thanks,
Deano
de...@cloudpixies.com

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: 28 January 2011 13:25
To: 'Eff Norwood'; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller
BB Write Cache

> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Eff Norwood
> 
> We tried all combinations of OCZ SSDs including their PCI based SSDs and
> they do NOT work as a ZIL. After a very short time performance degrades
> horribly and for the OCZ drives they eventually fail completely. 

This was something interesting I found recently.  Apparently for flash
manufacturers, flash hard drives are like the pimple on the butt of the
elephant. A vast majority of the flash production in the world goes into
devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
into hard drives.  As a result, they optimize for these other devices, and
one of the important side effects is that standard flash chips use an 8K
page size.  But hard drives use either 4K or 512B.  

The SSD controller secretly remaps blocks internally, and aggregates small
writes into a single 8K write, so there's really no way for the OS to know
if it's writing to a 4K block which happens to be shared with another 4K
block in the 8K page.  So it's unavoidable, and whenever it happens, the
drive can't simply write.  It must read modify write, which is obviously
much slower.

Also if you look up the specs of a SSD, both for IOPS and/or sustainable
throughput...  They lie.  Well, technically they're not lying because
technically it is *possible* to reach whatever they say.  Optimize your
usage patterns and only use blank drives which are new from box, or have
been fully TRIM'd.  Pt...  But in my experience, reality is about 50% of
whatever they say.

Presently, the only way to deal with all this is via the TRIM command, which
cannot eliminate the read/modify/write, but can reduce their occurrence.
Make sure your OS supports TRIM.  I'm not sure at what point ZFS added TRIM,
or to what extent...  Can't really measure the effectiveness myself.

Long story short, in the real world, you can expect the DDRDrive to crush
and shame the performance of any SSD you can find.  It's mostly a question
of PCIe slot versus SAS/SATA slot, and other characteristics you might care
about, like external power, etc.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best choice - file system for system

2011-01-28 Thread Evaldas Auryla

 On 01/28/11 02:37 PM, Edward Ned Harvey wrote:

Let's go into that a little bit.  If you're piping zfs send directly into
zfs receive, then it is an ideal backup method.  But not everybody can
afford the disk necessary to do that, so people are tempted to "zfs send"
to
a file or tape.  There are precisely two reasons why that's not
"officially"
recommended:
1- When you want to restore, it's all or nothing.  You can't selectively
restore a single file.
2- When you want to restore, it's all or nothing.  If a single bit is
corrupt in the data stream, the whole stream is lost.

Regarding point #2, I contend that zfs send is better than ufsdump.  I
would
prefer to discover corruption in the backup, rather than blindly restoring
it undetected.  Also, since the invention of zstreamdump, you are able to
detect any corruption during stream generation...  And you are able to
verify integrity of a stream after it is written to its destination.  All

Hi,

Be careful with zstreamdump, it has bug, at least in build 134, and I 
see the related CR is still open 
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6933259).


Regards,

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best choice - file system for system

2011-01-28 Thread Darren J Moffat

On 28/01/2011 13:37, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Tristram Scott

When it comes to dumping and restoring filesystems, there is still no

official

replacement for the ufsdump and ufsrestore.


Let's go into that a little bit.  If you're piping zfs send directly into
zfs receive, then it is an ideal backup method.  But not everybody can
afford the disk necessary to do that, so people are tempted to "zfs send" to
a file or tape.  There are precisely two reasons why that's not "officially"
recommended:


"Officially" yes you have it in quotes but where is the official 
reference for this ?


In fact I'd say the opposite.  In Solaris 11 Express the NDMP daemon can 
backup using dump, tar or zfs send stream.


This is also what the 'Sun ZFS Storage Appliance' does see here:

http://www.oracle.com/technetwork/articles/systems-hardware-architecture/ndmp-whitepaper-192164.pdf

On page 8 of the PDF titled: "About ZFS-NDMP Backup Support"

It does point out though that it is full ZFS dataset only, but 
incremental backup and incremental restore is supported.


This has been tested and is known to work with at least the following 
backup applications:


• Oracle Secure Backup 10.3.0.2 and above
• Enterprise Backup Software (EBS) / Legato Networker 7.5 and above
• Symantec NetBackup 6.5.3 and above


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] dedup experience with sufficient RAM/l2arc/cpu

2011-01-28 Thread Ware Adams
There's a lot of discussion of dedup performance issues (including problems 
backing out of using it which concerns me), but many/most of those involve 
relatively limited RAM and CPU configurations.  I wanted to see if there is 
experience that people could share using it on with higher RAM levels and l2arc.

We have built a backup storage server nearly identical to this:

http://www.natecarlson.com/2010/05/07/review-supermicros-sc847a-4u-chassis-with-36-drive-bays/

briefly:

SuperMicro 36 bay case
48 GB RAM
2x 5620 CPU
Hitachi A7K2000 drives for storage
X25-M for l2arc (160 GB)
4x LSI SAS9211-8i
Solaris 11 Express

The main storage pool is mirrored and uses gzip compression.  Our use consists 
of backing up daily snapshots of multiple MySQL hosts from a Sun 7410 
appliance.  We rsync the snapshot to the backup server (ZFS send to 
non-appliance host isn't supported on the 7000 unfortunately), snapshot (so now 
we have a snapshot of that matches the original on the 7410), clone, start 
MySQL on the clone to verify the backup, shut down MySQL.  We do this daily 
across 10 hosts which have significant overlap in data.

I might guess that dedup would provide good space savings, but before I turn it 
on I wanted to see if people with larger configurations had found it workable.  
My greatest concern are stories of not only poor performance but worse complete 
non-responsiveness when trying to zfs destroy a filesystem with dedup turned on.

We are somewhat flexible here.  We are not terribly pressed for space, and we 
do not need massive performance out of this.  Because of that I probably won't 
use dedup without hearing it is workable on a similar configuration, but if 
people have had success it would give us more cushion for inevitable data 
growth.

Thanks for any help,
Ware
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-28 Thread taemun
Comments below.

On 29 January 2011 00:25, Edward Ned Harvey <
opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:

> This was something interesting I found recently.  Apparently for flash
> manufacturers, flash hard drives are like the pimple on the butt of the
> elephant. A vast majority of the flash production in the world goes into
> devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
> into hard drives.

http://www.eetimes.com/electronics-news/4206361/SSDs--Still-not-a--solid-state--business
~6.1 percent for 2010, from that estimate (first thing that Google turned
up). Not denying what you said, I just like real figures rather than random
hearsay.


> As a result, they optimize for these other devices, and
> one of the important side effects is that standard flash chips use an 8K
> page size.  But hard drives use either 4K or 512B.
>
http://www.anandtech.com/Show/Index/2738?cPage=19&all=False&sort=0&page=5
Terms: "page" means the smallest data size that can be read or programmed
(written). "Block" means the smallest data size that can be erased. SSDs
commonly have a page size of 4KiB and a block size of 512KiB. I'd take
Anandtech's word on it.

There is probably some variance across the market, but for the vast
majority, this is true. Wikipedia's
http://en.wikipedia.org/wiki/Flash_memory#NAND_memories says that common
page sizes are 512B, 2KiB, and 4KiB.

The SSD controller secretly remaps blocks internally, and aggregates small
> writes into a single 8K write, so there's really no way for the OS to know
> if it's writing to a 4K block which happens to be shared with another 4K
> block in the 8K page.  So it's unavoidable, and whenever it happens, the
> drive can't simply write.  It must read modify write, which is obviously
> much slower.
>
This is be true, but for 512B to 4KiB aggregation, as the 8KiB page doesn't
exist. As for writing when everything is full, and you need to do an
erase. well this is where TRIM is helpful.

Also if you look up the specs of a SSD, both for IOPS and/or sustainable
> throughput...  They lie.  Well, technically they're not lying because
> technically it is *possible* to reach whatever they say.  Optimize your
> usage patterns and only use blank drives which are new from box, or have
> been fully TRIM'd.  Pt...  But in my experience, reality is about 50%
> of
> whatever they say.
>
> Presently, the only way to deal with all this is via the TRIM command,
> which
> cannot eliminate the read/modify/write, but can reduce their occurrence.
> Make sure your OS supports TRIM.  I'm not sure at what point ZFS added
> TRIM,
> or to what extent...  Can't really measure the effectiveness myself.
>
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6957655


> Long story short, in the real world, you can expect the DDRDrive to crush
> and shame the performance of any SSD you can find.  It's mostly a question
> of PCIe slot versus SAS/SATA slot, and other characteristics you might care
> about, like external power, etc.

Sure, DDR RAM will have a much quicker sync write time. This isn't really a
surprising result.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS root clone problem

2011-01-28 Thread alex bartonek
(for some reason I cannot find  my original thread..so I'm reposting it)

I am trying to move my data off of a 40gb 3.5" drive to a 40gb 2.5" drive.  
This is in a Netra running Solaris 10.

Originally what I did was:

zpool attach -f rpool c0t0d0 c0t2d0.

Then I did an installboot on c0t2d0s0.

Didnt work.  I was not able to boot from my second drive (c0t2d0).

I cannot remember my other commands but I ended up removing c0t2d0 from my 
pool.  So here is how it looks now:

# zpool status -v
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c0t0d0s0  ONLINE   0 0 0

zfs list shows no other drive connected to the pool.

I am trying to redo this to see where I went wrong but I get the following 
error:
zpool attach -f rpool c0t0d0 c0t2d0


# zpool attach -f rpool c0t0d0 c0t2d0
invalid vdev specification
the following errors must be manually repaired:
/dev/dsk/c0t2d0s0 is part of active ZFS pool rpool. Please see zpool(1M).
/dev/dsk/c0t2d0s2 is part of active ZFS pool rpool. Please see zpool(1M).


How can I remove c0t2d0 from the pool?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS root clone problem

2011-01-28 Thread Cindy Swearingen

Hi Alex,

Disks that are part of the root pool must contain a valid
slice 0 (this is boot restriction) and the disk names that you
present to ZFS for the root pool must also specify the slice
identifier (s0). For example, instead of this syntax:

# zpool attach -f rpool c0t0d0 c0t2d0

try this syntax:

# zpool attach rpool c0t0d0s0 c0t2d0s0

Then, apply the boot blocks to c0t2d0s0.

This issue was a bug in previous releases:

 # zpool attach -f rpool c0t0d0 c0t2d0
invalid vdev specification
the following errors must be manually repaired:
/dev/dsk/c0t2d0s0 is part of active ZFS pool rpool. Please see zpool(1M).
/dev/dsk/c0t2d0s2 is part of active ZFS pool rpool. Please see zpool(1M).

To workaround this bug, try this syntax:

# zpool attach -f rpool c0t0d0s0 c0t2d0s0

Thanks,

Cindy

On 01/27/11 19:18, alex bartonek wrote:

(for some reason I cannot find  my original thread..so I'm reposting it)

I am trying to move my data off of a 40gb 3.5" drive to a 40gb 2.5" drive.  
This is in a Netra running Solaris 10.

Originally what I did was:

zpool attach -f rpool c0t0d0 c0t2d0.

Then I did an installboot on c0t2d0s0.

Didnt work.  I was not able to boot from my second drive (c0t2d0).

I cannot remember my other commands but I ended up removing c0t2d0 from my 
pool.  So here is how it looks now:

# zpool status -v
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c0t0d0s0  ONLINE   0 0 0

zfs list shows no other drive connected to the pool.

I am trying to redo this to see where I went wrong but I get the following 
error:
zpool attach -f rpool c0t0d0 c0t2d0


# zpool attach -f rpool c0t0d0 c0t2d0
invalid vdev specification
the following errors must be manually repaired:
/dev/dsk/c0t2d0s0 is part of active ZFS pool rpool. Please see zpool(1M).
/dev/dsk/c0t2d0s2 is part of active ZFS pool rpool. Please see zpool(1M).


How can I remove c0t2d0 from the pool?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best choice - file system for system

2011-01-28 Thread Richard Elling
On Jan 27, 2011, at 4:34 AM, Tristram Scott wrote:

> I don't disagree that zfs is the better choice, but...
> 
>> Seriously though.  UFS is dead.  It has no advantage
>> over ZFS that I'm aware
>> of.
>> 
> 
> When it comes to dumping and restoring filesystems, there is still no official
> replacement for the ufsdump and ufsrestore.  The discussion has been had
> before, but to my knowledge, there is no consensus on the best method for
> backing up zfs filesystems.

ufsrestore works fine on ZFS :-)

But seriously, this is why we wrote the section in the ZFS Best Practices Guide
talking about traditional backup/restore.
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Using_ZFS_With_Enterprise_Backup_Solutions
Updates are graciously appreciated.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedup experience with sufficient RAM/l2arc/cpu

2011-01-28 Thread Richard Elling
comment below...

On Jan 28, 2011, at 7:13 AM, Ware Adams wrote:

> There's a lot of discussion of dedup performance issues (including problems 
> backing out of using it which concerns me), but many/most of those involve 
> relatively limited RAM and CPU configurations.  I wanted to see if there is 
> experience that people could share using it on with higher RAM levels and 
> l2arc.
> 
> We have built a backup storage server nearly identical to this:
> 
> http://www.natecarlson.com/2010/05/07/review-supermicros-sc847a-4u-chassis-with-36-drive-bays/
> 
> briefly:
> 
> SuperMicro 36 bay case
> 48 GB RAM
> 2x 5620 CPU
> Hitachi A7K2000 drives for storage
> X25-M for l2arc (160 GB)
> 4x LSI SAS9211-8i
> Solaris 11 Express
> 
> The main storage pool is mirrored and uses gzip compression.  Our use 
> consists of backing up daily snapshots of multiple MySQL hosts from a Sun 
> 7410 appliance.  We rsync the snapshot to the backup server (ZFS send to 
> non-appliance host isn't supported on the 7000 unfortunately), snapshot (so 
> now we have a snapshot of that matches the original on the 7410), clone, 
> start MySQL on the clone to verify the backup, shut down MySQL.  We do this 
> daily across 10 hosts which have significant overlap in data.
> 
> I might guess that dedup would provide good space savings, but before I turn 
> it on I wanted to see if people with larger configurations had found it 
> workable.  My greatest concern are stories of not only poor performance but 
> worse complete non-responsiveness when trying to zfs destroy a filesystem 
> with dedup turned on.
> 
> We are somewhat flexible here.  We are not terribly pressed for space, and we 
> do not need massive performance out of this.  Because of that I probably 
> won't use dedup without hearing it is workable on a similar configuration, 
> but if people have had success it would give us more cushion for inevitable 
> data growth.

I apologize for the shortness, but since you have such large, slow drives, 
rather than making
a single huge pool and deduping, create a pool per month/week/quarter. Send the 
snaps over
that you need, destroy the old pool. KISS & fast destroy.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedup experience with sufficient RAM/l2arc/cpu

2011-01-28 Thread Ware Adams
On Jan 28, 2011, at 12:21 PM, Richard Elling wrote:
> 
> On Jan 28, 2011, at 7:13 AM, Ware Adams wrote:
> 
>> SuperMicro 36 bay case
>> 48 GB RAM
>> 2x 5620 CPU
>> Hitachi A7K2000 drives for storage
>> X25-M for l2arc (160 GB)
>> 4x LSI SAS9211-8i
>> Solaris 11 Express
> 
> I apologize for the shortness, but since you have such large, slow drives, 
> rather than making
> a single huge pool and deduping, create a pool per month/week/quarter. Send 
> the snaps over
> that you need, destroy the old pool. KISS & fast destroy.

I hadn't thought about that, but I think it might add its own complexity.  Some 
more detail on what we are doing:

This host is a backup storage server for (currently) six MySQL hosts (whose 
data sets reside on NFS shares exported from the 7410).  Each data set is ~1.5 
TB uncompressed.  Of this about 30 GB changes per day (that's the rsync'd 
amount, ZFS send -i would be less but I can't do that from the 7410).  We are 
getting about 3.6:1 compression using gzip.

Then we are keeping daily backups for a month, weeklies for 6 months and then 
monthlies for a year.  By far our most frequent use of backups is an 
accidentally dropped table, but we also need with some frequency to recover 
from a situation where a user's code error was writing garbage to a field for 
say a month and they need to recover as of a certain date several months ago.  
So all in all we would like to keep quite a number of backups, say 6 hosts * 
(30 dailies + 20 weeklies + 6 monthlies) = 336.  The dailies and weeklies get 
pruned as they age into later time periods and aren't needed (and all are 
pruned after a year).

With the above I'd be able to have 18 pools with mirrors or 36 pools with just 
single drives.  So there are two things that would seem to add complexity.  
First, I'd have to assign each incoming snapshot from the 7410 to one of those 
pools based on whether it is going to expire or not.  I assume you could live 
with 18 or 36 "slots", but I haven't done the logic to exactly find out.  
Still, it would be some added complexity vs. today's process which is bascially:

rsync from 7410
snapshot
clone

The other issue is the rsync step.  With only one pool I just rsync the 30 GB 
of changed data to that MySQL hosts's share.  In the multiple pool scenario's I 
guess I would have a base copy of the full data set per pool?  That would eat 
up ~400 GB on each 2 TB pool, so I wouldn't be able to fit all 6 hosts onto a 
given pool.

We haven't done a lot of zfs destroy yet (though some in testing), so I can't 
say the current setup is workable.  But unless it is horribly slow there does 
seem to be some simplicity benefit from having a single pool.  I'll keep this 
in mind though.  We could probably have a larger pool for the 6 dailies per 
week that will be destroyed.  I'd still have to zfs send the base directory 
prior to rsync, but that would simplify some.

Thanks for the suggestion.

--Ware
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-28 Thread Eric D. Mudama

On Fri, Jan 28 at  8:25, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Eff Norwood

We tried all combinations of OCZ SSDs including their PCI based SSDs and
they do NOT work as a ZIL. After a very short time performance degrades
horribly and for the OCZ drives they eventually fail completely.


This was something interesting I found recently.  Apparently for flash
manufacturers, flash hard drives are like the pimple on the butt of the
elephant. A vast majority of the flash production in the world goes into
devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
into hard drives.  As a result, they optimize for these other devices, and
one of the important side effects is that standard flash chips use an 8K
page size.  But hard drives use either 4K or 512B.

The SSD controller secretly remaps blocks internally, and aggregates small
writes into a single 8K write, so there's really no way for the OS to know
if it's writing to a 4K block which happens to be shared with another 4K
block in the 8K page.  So it's unavoidable, and whenever it happens, the
drive can't simply write.  It must read modify write, which is obviously
much slower.


The reality is way more complicated, and statements like the above may
or may not be true on a vendor-by-vendor basis.

As time passes, the underlying NAND geometries are designed for
certain sets of advantages, continually subject to re-evaluation and
modification, and good SSD controllers on the top of NAND or other
solid-state storage will map those advantages effectively into our
problem domains as users.

Testing methodologies are improving over time as well, and eventually
it will be more clear which devices are suited to which tasks.

The suitability of a specific solution into a problem space will
always be a balance between cost, performance, reliability and time to
market.  No single solution (RAM SAN, RAM SSD, NAND SSD, BBU
controllers, rotating HDD, etc.) wins in every single area, or else we
wouldn't be having this discussion.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS Dedup question

2011-01-28 Thread Igor P
I created a zfs pool with dedup with the following settings:
zpool create data c8t1d0
zfs create data/shared
zfs set dedup=on data/shared

The thing I was wondering about was it seems like ZFS only dedup at the file 
level and not the block. When I make multiple copies of a file to the store I 
see an increase in the deup ratio, but when I copy similar files the ratio 
stays at 1.00x.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup question

2011-01-28 Thread Nicolas Williams
On Fri, Jan 28, 2011 at 01:38:11PM -0800, Igor P wrote:
> I created a zfs pool with dedup with the following settings:
> zpool create data c8t1d0
> zfs create data/shared
> zfs set dedup=on data/shared
> 
> The thing I was wondering about was it seems like ZFS only dedup at
> the file level and not the block. When I make multiple copies of a
> file to the store I see an increase in the deup ratio, but when I copy
> similar files the ratio stays at 1.00x.

Dedup is done at the block level, not file level.  "Similar files" does
not mean that they actually share common blocks.  You'll have to look
more closely to determine if they do.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup question

2011-01-28 Thread Jeff Savit

 On 01/28/11 02:38 PM, Igor P wrote:

I created a zfs pool with dedup with the following settings:
zpool create data c8t1d0
zfs create data/shared
zfs set dedup=on data/shared

The thing I was wondering about was it seems like ZFS only dedup at the file 
level and not the block. When I make multiple copies of a file to the store I 
see an increase in the deup ratio, but when I copy similar files the ratio 
stays at 1.00x.
Igor, ZFS does indeed perform dedup at the block level. Identical files 
have identical blocks, of course, but "similar" files may have 
differences such that data is inserted, deleted or changed so each block 
is different. Same data has to be on the same block alignment to have 
duplicate blocks. Also, it's important to have lots of RAM or high speed 
devices to quickly access metadata, or removing data will take a lot of 
time, so please use appropriately sized systems. That's been discussed a 
lot on this list.


See Jeff Bonwick's blog for a very good description: 
http://blogs.sun.com/bonwick/entry/zfs_dedup


I hope that's helpful,
  Jeff (a different Jeff)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup question

2011-01-28 Thread Freddie Cash
On Fri, Jan 28, 2011 at 1:38 PM, Igor P  wrote:
> I created a zfs pool with dedup with the following settings:
> zpool create data c8t1d0
> zfs create data/shared
> zfs set dedup=on data/shared
>
> The thing I was wondering about was it seems like ZFS only dedup at the file 
> level and not the block. When I make multiple copies of a file to the store I 
> see an increase in the deup ratio, but when I copy similar files the ratio 
> stays at 1.00x.

Easiest way to test it is to create a 10 MB file full of random data:
  $ dd if=/dev/random of=random.10M bs=1M count=10

Copy that to the pool a few times under different names to watch the
dedupe ratio increase, basically linearly.

Then open the file in a text editor and change the last few lines of
the files.  Copy that to the pool a few times under new names.  Watch
the dedupe ratio increase, but not linearly as the last block or three
of the file will be different.

Repeat changing different lines in the file, and watch as disk usage
only increases a little, since the files still "share" (or have in
common) a lot of blocks.

ZFS dedupe happens at the block layer, not the file layer.


-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS not usable (was ZFS Dedup question)

2011-01-28 Thread Roy Sigurd Karlsbakk
> I created a zfs pool with dedup with the following settings:
> zpool create data c8t1d0
> zfs create data/shared
> zfs set dedup=on data/shared
> 
> The thing I was wondering about was it seems like ZFS only dedup at
> the file level and not the block. When I make multiple copies of a
> file to the store I see an increase in the deup ratio, but when I copy
> similar files the ratio stays at 1.00x.

I've done some rather intensive tests on zfs dedup on this 12TB test system we 
have. I have concluded that with some 150B worth of L2ARC and 8GB ARC, ZFS 
dedup is unusable for volumes even at 2TB storage. It works, but it's dead slow 
in write terms, and the time to remove a dataset is still very long. I wouldn't 
recommend using ZFS dedup unless your name were Ahmed Nazif or Silvio 
Berlusconi, where the damage might be used for some good.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] multiple disk failure

2011-01-28 Thread Mike Tancsa
Hi,
I am using FreeBSD 8.2 and went to add 4 new disks today to expand my
offsite storage.  All was working fine for about 20min and then the new
drive cage started to fail.  Silly me for assuming new hardware would be
fine :(

The new drive cage started to fail, it hung the server and the box
rebooted.  After it rebooted, the entire pool is gone and in the state
below.  I had only written a few files to the new larger pool and I am
not concerned about restoring that data.  However, is there a way to get
back the original pool data ?
Going to http://www.sun.com/msg/ZFS-8000-3C gives a 503 error on the web
page listed BTW.


0(offsite)# zpool status
  pool: tank1
 state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tank1   UNAVAIL  0 0 0  insufficient replicas
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada4ONLINE   0 0 0
ada5ONLINE   0 0 0
ada6ONLINE   0 0 0
ada7ONLINE   0 0 0
  raidz1UNAVAIL  0 0 0  insufficient replicas
ada0UNAVAIL  0 0 0  cannot open
ada1UNAVAIL  0 0 0  cannot open
ada2UNAVAIL  0 0 0  cannot open
ada3UNAVAIL  0 0 0  cannot open
0(offsite)#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS root clone problem

2011-01-28 Thread alex bartonek
Hey Cindy...

wanted to post up on here since you've been helping me in email (which I 
greatly appreciate!).


I figured it out.. I've done the 'dd' thing before etc.  I got it all the way 
to where it was complaining that it cannot use a EFI labeled drive.  When I did 
a prtvtoc | fmthard on the drive, I was never able to change it to a SMI label. 
 So I went in there, changed the cylinder info, relabeled, changed it back, 
label..and voila..now I can mirror again!!

Thank you for taking the time to personally email me with my issue.

-Alex
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss