Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?

2010-06-24 Thread Arne Jansen
Hi,

Roy Sigurd Karlsbakk wrote:
> Crucial RealSSD C300 has been released and showing good numbers for use as 
> Zil and L2ARC. Does anyone know if this unit flushes its cache on request, as 
> opposed to Intel units etc?
> 

I had a chance to get my hands on a Crucial RealSSD C300/128MB yesterday and did
some quick testing. Here are the numbers first, some explanation follows below:

cache enabled, 32 buffers:
Linear read, 64k blocks: 134 MB/s
random read, 64k blocks: 134 MB/s
linear read, 4k blocks: 87 MB/s
random read, 4k blocks: 87 MB/s
linear write, 64k blocks: 107 MB/s
random write, 64k blocks: 110 MB/s
linear write, 4k blocks: 76 MB/s
random write, 4k blocks: 32 MB/s

cache enabled, 1 buffer:
linear write, 4k blocks: 51 MB/s (12800 ops/s)
random write, 4k blocks: 7 MB/s (1750 ops/s)
linear write, 64k blocks: 106 MB/s (1610 ops/s)
random write, 64k blocks: 59 MB/s (920 ops/s)

cache disabled, 1 buffer:
linear write, 4k blocks: 4.2 MB/s (1050 ops/s)
random write, 4k blocks: 3.9 MB/s (980 ops/s)
linear write, 64k blocks: 40 MB/s (650 ops/s)
random write, 64k blocks: 40 MB/s (650 ops/s)

cache disabled, 32 buffers:
linear write, 4k blocks: 4.5 MB/s, 1120 ops/s
random write, 4k blocks: 4.2 MB/s, 1050 ops/s
linear write, 64k blocks: 43 MB/s, 680 ops/s
random write, 64k blocks: 44 MB/s, 690 ops/s

cache enabled, 1 buffer, with cache flushes
linear write, 4k blocks, flush after every write: 1.5 MB/s, 385 writes/s
linear write, 4k blocks, flush after every 4th write: 4.2 MB/s, 1120 writes/s


The numbers are rough numbers read quickly from iostat, so please don't
multiply block size by ops and compare with the bandwidth given ;)
The test operates directly on top of LDI, just like ZFS.
 - "nk blocks" means the size of each read/write given to the device driver
 - "n buffers" means the number of buffers I keep in flight. This is to keep
   the command queue of the device busy
 - "cache flush" means a synchronous ioctl DKIOCFLUSHWRITECACHE

These numbers contain a few surprises (at least for me). The biggest surprise
is that with cache disabled one cannot get good data rates with small blocks,
even if one keeps the command queue filled. This is completely different from
what I've seen from hard drives.
Also the IOPS with cache flushes is quite low, 385 is not much better than
a 15k hdd, while the latter scales better. On the other hand, from the large
drop in performance when using flushes one could infer that they indeed flush
properly, but I haven't built a test setup for that yet.

Conclusion: From the measurements I'd infer the device makes a good L2ARC,
but for a slog device the latency is too high and it doesn't scale well.

I'll do similar tests on a x-25 and ocz vertex 2 pro as soon as they arrive.

If there are numbers you are missing please tell me, I'll measure them if
possible. Also please ask if there are questions regarding the test setup.

--
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?

2010-06-24 Thread Fred Liu
Looking forward to see your test report from intel x-25 and ocz vertex 2 pro...

Thanks.

Fred

-Original Message-
From: zfs-discuss-boun...@opensolaris.org 
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Arne Jansen
Sent: 星期四, 六月 24, 2010 16:15
To: Roy Sigurd Karlsbakk
Cc: OpenSolaris ZFS discuss
Subject: Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?

Hi,

Roy Sigurd Karlsbakk wrote:
> Crucial RealSSD C300 has been released and showing good numbers for use as 
> Zil and L2ARC. Does anyone know if this unit flushes its cache on request, as 
> opposed to Intel units etc?
> 

I had a chance to get my hands on a Crucial RealSSD C300/128MB yesterday and did
some quick testing. Here are the numbers first, some explanation follows below:

cache enabled, 32 buffers:
Linear read, 64k blocks: 134 MB/s
random read, 64k blocks: 134 MB/s
linear read, 4k blocks: 87 MB/s
random read, 4k blocks: 87 MB/s
linear write, 64k blocks: 107 MB/s
random write, 64k blocks: 110 MB/s
linear write, 4k blocks: 76 MB/s
random write, 4k blocks: 32 MB/s

cache enabled, 1 buffer:
linear write, 4k blocks: 51 MB/s (12800 ops/s)
random write, 4k blocks: 7 MB/s (1750 ops/s)
linear write, 64k blocks: 106 MB/s (1610 ops/s)
random write, 64k blocks: 59 MB/s (920 ops/s)

cache disabled, 1 buffer:
linear write, 4k blocks: 4.2 MB/s (1050 ops/s)
random write, 4k blocks: 3.9 MB/s (980 ops/s)
linear write, 64k blocks: 40 MB/s (650 ops/s)
random write, 64k blocks: 40 MB/s (650 ops/s)

cache disabled, 32 buffers:
linear write, 4k blocks: 4.5 MB/s, 1120 ops/s
random write, 4k blocks: 4.2 MB/s, 1050 ops/s
linear write, 64k blocks: 43 MB/s, 680 ops/s
random write, 64k blocks: 44 MB/s, 690 ops/s

cache enabled, 1 buffer, with cache flushes
linear write, 4k blocks, flush after every write: 1.5 MB/s, 385 writes/s
linear write, 4k blocks, flush after every 4th write: 4.2 MB/s, 1120 writes/s


The numbers are rough numbers read quickly from iostat, so please don't
multiply block size by ops and compare with the bandwidth given ;)
The test operates directly on top of LDI, just like ZFS.
 - "nk blocks" means the size of each read/write given to the device driver
 - "n buffers" means the number of buffers I keep in flight. This is to keep
   the command queue of the device busy
 - "cache flush" means a synchronous ioctl DKIOCFLUSHWRITECACHE

These numbers contain a few surprises (at least for me). The biggest surprise
is that with cache disabled one cannot get good data rates with small blocks,
even if one keeps the command queue filled. This is completely different from
what I've seen from hard drives.
Also the IOPS with cache flushes is quite low, 385 is not much better than
a 15k hdd, while the latter scales better. On the other hand, from the large
drop in performance when using flushes one could infer that they indeed flush
properly, but I haven't built a test setup for that yet.

Conclusion: From the measurements I'd infer the device makes a good L2ARC,
but for a slog device the latency is too high and it doesn't scale well.

I'll do similar tests on a x-25 and ocz vertex 2 pro as soon as they arrive.

If there are numbers you are missing please tell me, I'll measure them if
possible. Also please ask if there are questions regarding the test setup.

--
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski

On 23/06/2010 18:50, Adam Leventhal wrote:

Does it mean that for dataset used for databases and similar environments where 
basically all blocks have fixed size and there is no other data all parity 
information will end-up on one (z1) or two (z2) specific disks?
 

No. There are always smaller writes to metadata that will distribute parity. 
What is the total width of your raidz1 stripe?

   


4x disks, 16KB recordsize, 128GB file, random read with 16KB block.

--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski

On 23/06/2010 19:29, Ross Walker wrote:

On Jun 23, 2010, at 1:48 PM, Robert Milkowski  wrote:

   

128GB.

Does it mean that for dataset used for databases and similar environments where 
basically all blocks have fixed size and there is no other data all parity 
information will end-up on one (z1) or two (z2) specific disks?
 

What's the record size on those datasets?

8k?

   


16K

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?

2010-06-24 Thread Arne Jansen
Arne Jansen wrote:
> Hi,
> 
> Roy Sigurd Karlsbakk wrote:
>> Crucial RealSSD C300 has been released and showing good numbers for use as 
>> Zil and L2ARC. Does anyone know if this unit flushes its cache on request, 
>> as opposed to Intel units etc?
>>
> 
> I had a chance to get my hands on a Crucial RealSSD C300/128MB yesterday and 
> did
> some quick testing. Here are the numbers first, some explanation follows 
> below:

After taemun alerted my that the linear read/write numbers are too low I found a
bottleneck: the controller decided to connect the SSD with only 1.5GBit. I have
to check if we can jumper it to least 3GBit. To connect it with 6GBit we need
some new cables, so this might take some time.
The main purpose of this test was to evaluate the SSD with respect to usage as
a slog device and I think the connection speed doesn't affect this. Nevertheless
I'll repeat the tests as soon as we solved the issues.

Sorry.

--Arne

> 
> cache enabled, 32 buffers:
> Linear read, 64k blocks: 134 MB/s
> random read, 64k blocks: 134 MB/s
> linear read, 4k blocks: 87 MB/s
> random read, 4k blocks: 87 MB/s
> linear write, 64k blocks: 107 MB/s
> random write, 64k blocks: 110 MB/s
> linear write, 4k blocks: 76 MB/s
> random write, 4k blocks: 32 MB/s
> 
> cache enabled, 1 buffer:
> linear write, 4k blocks: 51 MB/s (12800 ops/s)
> random write, 4k blocks: 7 MB/s (1750 ops/s)
> linear write, 64k blocks: 106 MB/s (1610 ops/s)
> random write, 64k blocks: 59 MB/s (920 ops/s)
> 
> cache disabled, 1 buffer:
> linear write, 4k blocks: 4.2 MB/s (1050 ops/s)
> random write, 4k blocks: 3.9 MB/s (980 ops/s)
> linear write, 64k blocks: 40 MB/s (650 ops/s)
> random write, 64k blocks: 40 MB/s (650 ops/s)
> 
> cache disabled, 32 buffers:
> linear write, 4k blocks: 4.5 MB/s, 1120 ops/s
> random write, 4k blocks: 4.2 MB/s, 1050 ops/s
> linear write, 64k blocks: 43 MB/s, 680 ops/s
> random write, 64k blocks: 44 MB/s, 690 ops/s
> 
> cache enabled, 1 buffer, with cache flushes
> linear write, 4k blocks, flush after every write: 1.5 MB/s, 385 writes/s
> linear write, 4k blocks, flush after every 4th write: 4.2 MB/s, 1120 writes/s
> 
> 
> The numbers are rough numbers read quickly from iostat, so please don't
> multiply block size by ops and compare with the bandwidth given ;)
> The test operates directly on top of LDI, just like ZFS.
>  - "nk blocks" means the size of each read/write given to the device driver
>  - "n buffers" means the number of buffers I keep in flight. This is to keep
>the command queue of the device busy
>  - "cache flush" means a synchronous ioctl DKIOCFLUSHWRITECACHE
> 
> These numbers contain a few surprises (at least for me). The biggest surprise
> is that with cache disabled one cannot get good data rates with small blocks,
> even if one keeps the command queue filled. This is completely different from
> what I've seen from hard drives.
> Also the IOPS with cache flushes is quite low, 385 is not much better than
> a 15k hdd, while the latter scales better. On the other hand, from the large
> drop in performance when using flushes one could infer that they indeed flush
> properly, but I haven't built a test setup for that yet.
> 
> Conclusion: From the measurements I'd infer the device makes a good L2ARC,
> but for a slog device the latency is too high and it doesn't scale well.
> 
> I'll do similar tests on a x-25 and ocz vertex 2 pro as soon as they arrive.
> 
> If there are numbers you are missing please tell me, I'll measure them if
> possible. Also please ask if there are questions regarding the test setup.
> 
> --
> Arne
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] c5->c9 device name change prevents beadm activate

2010-06-24 Thread Brian Nitz

Lori,

In my case what may have caused the problem is that after a previous 
upgrade failed, I used this zfs send/recv procedure to give me (what I 
thought was) a sane rpool:


http://blogs.sun.com/migi/entry/broken_opensolaris_never

Is it possible that a zfs recv of a root pool contains the device names 
from the sending hardware?


On 06/23/10 18:15, Lori Alt wrote:

Cindy Swearingen wrote:



On 06/23/10 10:40, Evan Layton wrote:

On 6/23/10 4:29 AM, Brian Nitz wrote:

I saw a problem while upgrading from build 140 to 141 where beadm
activate {build141BE} failed because installgrub failed:

# BE_PRINT_ERR=true beadm activate opensolarismigi-4
be_do_installgrub: installgrub failed for device c5t0d0s0.
Unable to activate opensolarismigi-4.
Unknown external error.

The reason installgrub failed is that it is attempting to install grub
on c5t0d0s0 which is where my root pool is:
# zpool status
pool: rpool
state: ONLINE
status: The pool is formatted using an older on-disk format. The 
pool can

still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scan: scrub repaired 0 in 5h3m with 0 errors on Tue Jun 22 22:31:08 
2010

config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c5t0d0s0 ONLINE 0 0 0

errors: No known data errors

But the raw device doesn't exist:
# ls -ls /dev/rdsk/c5*
/dev/rdsk/c5*: No such file or directory

Even though zfs pool still sees it as c5, the actual device seen by
format is c9t0d0s0


Is there any workaround for this problem? Is it a bug in install, 
zfs or

somewhere else in ON?



In this instance beadm is a victim of the zpool configuration reporting
the wrong device. This does appear to be a ZFS issue since the device
actually being used is not what zpool status is reporting. I'm 
forwarding

this on to the ZFS alias to see if anyone has any thoughts there.

-evan


Hi Evan,

I suspect that some kind of system, hardware, or firmware event changed
this device name. We could identify the original root pool device with
the zpool history output from this pool.

Brian, you could boot this system from the OpenSolaris LiveCD and
attempt to import this pool to see if that will update the device info
correctly.

If that doesn't help, then create /dev/rdsk/c5* symlinks to point to
the correct device.

I've seen this kind of device name change in a couple contexts now 
related to installs, image-updates, etc.


I think we need to understand why this is happening.  Prior to 
OpenSolaris and the new installer, we used to go to a fair amount of 
trouble to make sure that device names, once assigned, never changed.  
Various parts of the system depended on device names remaining the 
same across upgrades and other system events.


Does anyone know why these device names are changing?  Because that 
seems like the root of the problem.  Creating symlinks with the old 
names seems like a band-aid, which could cause problems down the 
road--what if some other device on the system gets assigned that name 
on a future update?


Lori







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Ross Walker
On Jun 24, 2010, at 5:40 AM, Robert Milkowski  wrote:

> On 23/06/2010 18:50, Adam Leventhal wrote:
>>> Does it mean that for dataset used for databases and similar environments 
>>> where basically all blocks have fixed size and there is no other data all 
>>> parity information will end-up on one (z1) or two (z2) specific disks?
>>> 
>> No. There are always smaller writes to metadata that will distribute parity. 
>> What is the total width of your raidz1 stripe?
>> 
>>   
> 
> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.

>From what I gather each 16KB record (plus parity) is spread across the raidz 
>disks. This causes the total random IOPS (write AND read) of the raidz to be 
>that of the slowest disk in the raidz.

Raidz is definitely made for sequential IO patterns not random. To get good 
random IO with raidz you need a zpool with X raidz vdevs where X = desired 
IOPS/IOPS of single drive.

-Ross


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?

2010-06-24 Thread Arne Jansen
Arne Jansen wrote:
> Hi,
> 
> Roy Sigurd Karlsbakk wrote:
>> Crucial RealSSD C300 has been released and showing good numbers for use as 
>> Zil and L2ARC. Does anyone know if this unit flushes its cache on request, 
>> as opposed to Intel units etc?
>>
> 
> Also the IOPS with cache flushes is quite low, 385 is not much better than
> a 15k hdd, while the latter scales better. On the other hand, from the large
> drop in performance when using flushes one could infer that they indeed flush
> properly, but I haven't built a test setup for that yet.
> 

Result from cache flush test: While doing synchronous writes with full speed
we pulled the device from the system and compared the contents afterwards.
Result: no writes lost. We repeated the test several times.
Cross check: we pulled also while writing with cache enabled, and it lost
8 writes.

So I'd say, yes, it flushes its cache on request.

--
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?

2010-06-24 Thread David Dyer-Bennet

On Thu, June 24, 2010 08:58, Arne Jansen wrote:

> Cross check: we pulled also while writing with cache enabled, and it lost
> 8 writes.

I'm SO pleased to see somebody paranoid enough to do that kind of
cross-check doing this benchmarking!

"Benchmarking is hard!"

> So I'd say, yes, it flushes its cache on request.

Starting to sound pretty convincing,  yes.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski

On 24/06/2010 14:32, Ross Walker wrote:

On Jun 24, 2010, at 5:40 AM, Robert Milkowski  wrote:

   

On 23/06/2010 18:50, Adam Leventhal wrote:
 

Does it mean that for dataset used for databases and similar environments where 
basically all blocks have fixed size and there is no other data all parity 
information will end-up on one (z1) or two (z2) specific disks?

 

No. There are always smaller writes to metadata that will distribute parity. 
What is the total width of your raidz1 stripe?


   

4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
 

 From what I gather each 16KB record (plus parity) is spread across the raidz 
disks. This causes the total random IOPS (write AND read) of the raidz to be 
that of the slowest disk in the raidz.

Raidz is definitely made for sequential IO patterns not random. To get good 
random IO with raidz you need a zpool with X raidz vdevs where X = desired 
IOPS/IOPS of single drive.
   


I know that and it wasn't mine question.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Bob Friesenhahn

On Thu, 24 Jun 2010, Ross Walker wrote:


Raidz is definitely made for sequential IO patterns not random. To 
get good random IO with raidz you need a zpool with X raidz vdevs 
where X = desired IOPS/IOPS of single drive.


Remarkably, I have yet to see mention of someone testing a raidz which 
is comprised entirely of FLASH SSDs.  This should help with the IOPS, 
particularly when reading.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs failsafe pool mismatch

2010-06-24 Thread Shawn Belaire
I have a customer that described this issue to me in general terms.

I'd like to know how to replicated it, and what the best practice is to a avoid 
the issue, or fix it in an accepted manner.

If they kernel patch, and reboot they may get messages informing them that the 
pool version is down rev'd.  If they act on the message and upgrade the pool 
version, then have to boot from the failsafe it fails as that kernel does not 
support that pool version.

What would be a way to fix this,  and should we allow they catch to even happen?

Thanks
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski

On 24/06/2010 15:54, Bob Friesenhahn wrote:

On Thu, 24 Jun 2010, Ross Walker wrote:


Raidz is definitely made for sequential IO patterns not random. To 
get good random IO with raidz you need a zpool with X raidz vdevs 
where X = desired IOPS/IOPS of single drive.


Remarkably, I have yet to see mention of someone testing a raidz which 
is comprised entirely of FLASH SSDs.  This should help with the IOPS, 
particularly when reading.


I have.

Briefly:


  X4270 2x Quad-core 2.93GHz, 72GB RAM
  Open Solaris 2009.06 (snv_111b)
  ARC limited to 4GB
  44x SSD in a F5100.
  4x SAS HBAs, 4x physical SAS connections to the f5100 (16x SAS 
channels in total), each to a different domain.



1. RAID-10 pool

22x mirrors across domains
ZFS: 16KB recordsize, atime=off
randomread filebennch benchmark with a 16KB block size with 1, 16, 
..., 128 threads, 128GB working set.


maximum performance when 128 threads: ~137,000 ops/s

2. RAID-Z pool

11x 4-way RAID-z, each raid-z vdev across domains
ZFS: recordsize=16k, atime=off
randomread filebennch benchmark with a 16KB block size with 1, 16, 
..., 128 threads, 128GB working set.


maximum performance when 64-128 threads: ~34,000 ops/s

With a ZFS recordsize of 32KB it got up-to ~41,000 ops/s.
Larger ZFS record sizes produced worse results.



RAID-Z delivered about 3.3X less ops/s compared to RAID-10 here.
SSDs do not make any fundamental chanage here and RAID-Z characteristics 
are basically the same whether it is configured out of SSDs or HDDs.


However SSDs could of course provide a good-enough performance even with 
RAID-Z, as at the end of a day it is not about benchmarks but your 
environment requirements.


A given number of SSDs in a RAID-Z configuration is able to deliver the 
same performance as a much greater number of disk drives in RAID-10 
configuration and if you don't need much space it could make sense.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.

2010-06-24 Thread Eric Jones
Where is the link to the script, and does it work with RAIDZ arrays?  Thanks so 
much.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs failsafe pool mismatch

2010-06-24 Thread Cindy Swearingen

Hi Shawn,

I think this can happen if you apply patch 141445-09.
It should not happen in the future.

I believe the workaround is this:

1. Boot the system from the correct media.

2. Install the boot blocks on the root pool disk(s).

3. Upgrade the pool.

Thanks,

Cindy

On 06/24/10 09:24, Shawn Belaire wrote:

I have a customer that described this issue to me in general terms.

I'd like to know how to replicated it, and what the best practice is to a avoid 
the issue, or fix it in an accepted manner.

If they kernel patch, and reboot they may get messages informing them that the 
pool version is down rev'd.  If they act on the message and upgrade the pool 
version, then have to boot from the failsafe it fails as that kernel does not 
support that pool version.

What would be a way to fix this,  and should we allow they catch to even happen?

Thanks

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS Filesystem Recovery on RAIDZ Array

2010-06-24 Thread Eric Jones
This day went from usual Thursday to worst day of my life in the span of about 
10 seconds.  Here's the scenario:

2 Computer, both Solaris 10u8, one is the primary, one is the backup.  Primary 
system is RAIDZ2, Backup is RAIDZ with 4 drives.  Every night, Primary mirrors 
to Backup using the 'zfs send' command. The Backup receives that with 'zfs recv 
-vFd'.  This ensures that both machines have an identical set of 
filesystems/snapshots every night.  (Snapshots are taken on Primary every hour 
during the workday).

The issue began monday when Primary failed.  After restoring it to operating 
condition I began restoring the filesystems from Backup, again using ZFS 
send/recv.  By midnight, only about half of the data had recovered, at which 
point Primary attempted its regularly schedule mirror operation with Backup.  
One of our primary ZFS filesystems had not yet been restored, and since it 
wasn't on Primary when the mirror operation began, 'zfs recv' destroyed it on 
the Backup system.  AH.

So, in short, a RAIDZ array contained 7 ZFS filesystems + dozens of snapshots 
in one RAIDZ pool.  12 hours ago some of those filesystems were destroyed, 
effectively by a zfs destroy command (executed by zfs recv).  No data has been 
written to that pool since then.  Is there anyway to revert it to the state it 
was in 12 hours ago?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Gonna be stupid here...

2010-06-24 Thread Erik Trimble

But it's early (for me), and I can't remember the answer here.

I'm sizing an Oracle database appliance.   I'd like to get one of the 
F20 96GB flash accellerators to play with, but I can't imagine I'd be 
using the whole thing for ZIL.  The DB is likely to be a couple TB in size.


Couple of questions:

(a) since everything is going to be zvols, and I'm going to be doing 
lots of sync writes to them, I'm thinking that allocating around a dozen 
GB of the F20's flash would be useful.  :-)


(b) can zvols still make use of an L2ARC device for their pool?  I'm 
assuming so, since it's both block and metadata that get stored there. 
I'm considering adding a couple of very large SSDs to I might be able to 
cache most of my DB in the L2ARC, if that works.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Ross Walker
On Jun 24, 2010, at 10:42 AM, Robert Milkowski  wrote:

> On 24/06/2010 14:32, Ross Walker wrote:
>> On Jun 24, 2010, at 5:40 AM, Robert Milkowski  wrote:
>> 
>>   
>>> On 23/06/2010 18:50, Adam Leventhal wrote:
>>> 
> Does it mean that for dataset used for databases and similar environments 
> where basically all blocks have fixed size and there is no other data all 
> parity information will end-up on one (z1) or two (z2) specific disks?
> 
> 
 No. There are always smaller writes to metadata that will distribute 
 parity. What is the total width of your raidz1 stripe?
 
 
   
>>> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
>>> 
>> From what I gather each 16KB record (plus parity) is spread across the raidz 
>> disks. This causes the total random IOPS (write AND read) of the raidz to be 
>> that of the slowest disk in the raidz.
>> 
>> Raidz is definitely made for sequential IO patterns not random. To get good 
>> random IO with raidz you need a zpool with X raidz vdevs where X = desired 
>> IOPS/IOPS of single drive.
>>   
> 
> I know that and it wasn't mine question.

Sorry, for the OP...


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Adam Leventhal
Hey Robert,

I've filed a bug to track this issue. We'll try to reproduce the problem and 
evaluate the cause. Thanks for bringing this to our attention.

Adam

On Jun 24, 2010, at 2:40 AM, Robert Milkowski wrote:

> On 23/06/2010 18:50, Adam Leventhal wrote:
>>> Does it mean that for dataset used for databases and similar environments 
>>> where basically all blocks have fixed size and there is no other data all 
>>> parity information will end-up on one (z1) or two (z2) specific disks?
>>> 
>> No. There are always smaller writes to metadata that will distribute parity. 
>> What is the total width of your raidz1 stripe?
>> 
>>   
> 
> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
> 
> -- 
> Robert Milkowski
> http://milek.blogspot.com
> 
> 


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Gonna be stupid here...

2010-06-24 Thread Darren J Moffat

On 24/06/2010 17:49, Erik Trimble wrote:

But it's early (for me), and I can't remember the answer here.

I'm sizing an Oracle database appliance. I'd like to get one of the F20
96GB flash accellerators to play with, but I can't imagine I'd be using
the whole thing for ZIL. The DB is likely to be a couple TB in size.

Couple of questions:

(a) since everything is going to be zvols, and I'm going to be doing
lots of sync writes to them, I'm thinking that allocating around a dozen
GB of the F20's flash would be useful. :-)

(b) can zvols still make use of an L2ARC device for their pool? I'm
assuming so, since it's both block and metadata that get stored there.
I'm considering adding a couple of very large SSDs to I might be able to
cache most of my DB in the L2ARC, if that works.


Yes, the level that the L2ARC works at doesn't care if the dataset is 
filesystem or ZVOL.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Arne Jansen

Ross Walker wrote:


Raidz is definitely made for sequential IO patterns not random. To get good 
random IO with raidz you need a zpool with X raidz vdevs where X = desired 
IOPS/IOPS of single drive.



I have seen statements like this repeated several times, though
I haven't been able to find an in-depth discussion of why this
is the case. From what I've gathered every block (what is the
correct term for this? zio block?) written is spread across the
whole raid-z. But in what units? will a 4k write be split into
512 byte writes? And in the opposite direction, every block needs
to be read fully, even if only parts of it are being requested,
because the checksum needs to be checked? Will the parity be
read, too?
If this is all the case, I can see why raid-z reduces the performance
of an array effectively to one device w.r.t. random reads.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] c5->c9 device name change prevents beadm activate

2010-06-24 Thread Lori Alt

On 06/24/10 03:27 AM, Brian Nitz wrote:

Lori,

In my case what may have caused the problem is that after a previous 
upgrade failed, I used this zfs send/recv procedure to give me (what I 
thought was) a sane rpool:


http://blogs.sun.com/migi/entry/broken_opensolaris_never

Is it possible that a zfs recv of a root pool contains the device 
names from the sending hardware?


Yes, the data installed by the zfs recv will contain the device names 
from the sending hardware.


I looked at the instructions in the blog you reference above and while 
the procedure *might* work in some circumstances, it would mostly be by 
accident.  Maybe if there is an exact match of hardware, it might work, 
but there's also metadata that describes the BEs on a system and I doubt 
whether the send/recv would restore all the information necessary to do 
that.


You might want to bring this subject up on the 
caiman-disc...@opensolaris.org alias, where needs like this can be 
addressed for real, in the supported installation tools.


Lori





On 06/23/10 18:15, Lori Alt wrote:

Cindy Swearingen wrote:



On 06/23/10 10:40, Evan Layton wrote:

On 6/23/10 4:29 AM, Brian Nitz wrote:

I saw a problem while upgrading from build 140 to 141 where beadm
activate {build141BE} failed because installgrub failed:

# BE_PRINT_ERR=true beadm activate opensolarismigi-4
be_do_installgrub: installgrub failed for device c5t0d0s0.
Unable to activate opensolarismigi-4.
Unknown external error.

The reason installgrub failed is that it is attempting to install 
grub

on c5t0d0s0 which is where my root pool is:
# zpool status
pool: rpool
state: ONLINE
status: The pool is formatted using an older on-disk format. The 
pool can

still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, 
the

pool will no longer be accessible on older software versions.
scan: scrub repaired 0 in 5h3m with 0 errors on Tue Jun 22 
22:31:08 2010

config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c5t0d0s0 ONLINE 0 0 0

errors: No known data errors

But the raw device doesn't exist:
# ls -ls /dev/rdsk/c5*
/dev/rdsk/c5*: No such file or directory

Even though zfs pool still sees it as c5, the actual device seen by
format is c9t0d0s0


Is there any workaround for this problem? Is it a bug in install, 
zfs or

somewhere else in ON?



In this instance beadm is a victim of the zpool configuration 
reporting

the wrong device. This does appear to be a ZFS issue since the device
actually being used is not what zpool status is reporting. I'm 
forwarding

this on to the ZFS alias to see if anyone has any thoughts there.

-evan


Hi Evan,

I suspect that some kind of system, hardware, or firmware event changed
this device name. We could identify the original root pool device with
the zpool history output from this pool.

Brian, you could boot this system from the OpenSolaris LiveCD and
attempt to import this pool to see if that will update the device info
correctly.

If that doesn't help, then create /dev/rdsk/c5* symlinks to point to
the correct device.

I've seen this kind of device name change in a couple contexts now 
related to installs, image-updates, etc.


I think we need to understand why this is happening.  Prior to 
OpenSolaris and the new installer, we used to go to a fair amount of 
trouble to make sure that device names, once assigned, never 
changed.  Various parts of the system depended on device names 
remaining the same across upgrades and other system events.


Does anyone know why these device names are changing?  Because that 
seems like the root of the problem.  Creating symlinks with the old 
names seems like a band-aid, which could cause problems down the 
road--what if some other device on the system gets assigned that name 
on a future update?


Lori









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski

On 24/06/2010 20:52, Arne Jansen wrote:

Ross Walker wrote:

Raidz is definitely made for sequential IO patterns not random. To 
get good random IO with raidz you need a zpool with X raidz vdevs 
where X = desired IOPS/IOPS of single drive.




I have seen statements like this repeated several times, though
I haven't been able to find an in-depth discussion of why this
is the case. From what I've gathered every block (what is the
correct term for this? zio block?) written is spread across the
whole raid-z. But in what units? will a 4k write be split into
512 byte writes? And in the opposite direction, every block needs
to be read fully, even if only parts of it are being requested,
because the checksum needs to be checked? Will the parity be
read, too?
If this is all the case, I can see why raid-z reduces the performance
of an array effectively to one device w.r.t. random reads.



http://blogs.sun.com/roch/entry/when_to_and_not_to

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] One dataset per user?

2010-06-24 Thread Paul B. Henson
On Tue, 22 Jun 2010, Arne Jansen wrote:

> We found that the zfs utility is very inefficient as it does a lot of
> unnecessary and costly checks.

Hmm, presumably somebody at Sun doesn't agree with that assessment or you'd
think they'd take them out :).

Mounting/sharing by hand outside of the zfs framework does make a huge
difference. It takes about 45 minutes to mount/share or unshare/unmount
with the mountpoint and sharenfs zfs properties set, mounting/sharing by
hand with SHARE_NOINUSE_CHECK=1 even just sequentially only took about 2
minutes. With some parallelization I could definitely see hitting that 10
seconds you mentioned, which would sure make my patch windows a hell of a
lot shorter. I'll need put together a script and fiddle some with smf, joy
oh joy, I need these filesystems mounted before the web server starts.

Thanks much for the tip!

I'm hoping someday they'll clean up the sharing implementation and make it
a bit more scalable. I had a ticket open once and they pretty much said it
would never happen for Solaris 10, but maybe sometime in the indefinite
future for OpenSolaris...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss