Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?
Hi, Roy Sigurd Karlsbakk wrote: > Crucial RealSSD C300 has been released and showing good numbers for use as > Zil and L2ARC. Does anyone know if this unit flushes its cache on request, as > opposed to Intel units etc? > I had a chance to get my hands on a Crucial RealSSD C300/128MB yesterday and did some quick testing. Here are the numbers first, some explanation follows below: cache enabled, 32 buffers: Linear read, 64k blocks: 134 MB/s random read, 64k blocks: 134 MB/s linear read, 4k blocks: 87 MB/s random read, 4k blocks: 87 MB/s linear write, 64k blocks: 107 MB/s random write, 64k blocks: 110 MB/s linear write, 4k blocks: 76 MB/s random write, 4k blocks: 32 MB/s cache enabled, 1 buffer: linear write, 4k blocks: 51 MB/s (12800 ops/s) random write, 4k blocks: 7 MB/s (1750 ops/s) linear write, 64k blocks: 106 MB/s (1610 ops/s) random write, 64k blocks: 59 MB/s (920 ops/s) cache disabled, 1 buffer: linear write, 4k blocks: 4.2 MB/s (1050 ops/s) random write, 4k blocks: 3.9 MB/s (980 ops/s) linear write, 64k blocks: 40 MB/s (650 ops/s) random write, 64k blocks: 40 MB/s (650 ops/s) cache disabled, 32 buffers: linear write, 4k blocks: 4.5 MB/s, 1120 ops/s random write, 4k blocks: 4.2 MB/s, 1050 ops/s linear write, 64k blocks: 43 MB/s, 680 ops/s random write, 64k blocks: 44 MB/s, 690 ops/s cache enabled, 1 buffer, with cache flushes linear write, 4k blocks, flush after every write: 1.5 MB/s, 385 writes/s linear write, 4k blocks, flush after every 4th write: 4.2 MB/s, 1120 writes/s The numbers are rough numbers read quickly from iostat, so please don't multiply block size by ops and compare with the bandwidth given ;) The test operates directly on top of LDI, just like ZFS. - "nk blocks" means the size of each read/write given to the device driver - "n buffers" means the number of buffers I keep in flight. This is to keep the command queue of the device busy - "cache flush" means a synchronous ioctl DKIOCFLUSHWRITECACHE These numbers contain a few surprises (at least for me). The biggest surprise is that with cache disabled one cannot get good data rates with small blocks, even if one keeps the command queue filled. This is completely different from what I've seen from hard drives. Also the IOPS with cache flushes is quite low, 385 is not much better than a 15k hdd, while the latter scales better. On the other hand, from the large drop in performance when using flushes one could infer that they indeed flush properly, but I haven't built a test setup for that yet. Conclusion: From the measurements I'd infer the device makes a good L2ARC, but for a slog device the latency is too high and it doesn't scale well. I'll do similar tests on a x-25 and ocz vertex 2 pro as soon as they arrive. If there are numbers you are missing please tell me, I'll measure them if possible. Also please ask if there are questions regarding the test setup. -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?
Looking forward to see your test report from intel x-25 and ocz vertex 2 pro... Thanks. Fred -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Arne Jansen Sent: 星期四, 六月 24, 2010 16:15 To: Roy Sigurd Karlsbakk Cc: OpenSolaris ZFS discuss Subject: Re: [zfs-discuss] Crucial RealSSD C300 and cache flush? Hi, Roy Sigurd Karlsbakk wrote: > Crucial RealSSD C300 has been released and showing good numbers for use as > Zil and L2ARC. Does anyone know if this unit flushes its cache on request, as > opposed to Intel units etc? > I had a chance to get my hands on a Crucial RealSSD C300/128MB yesterday and did some quick testing. Here are the numbers first, some explanation follows below: cache enabled, 32 buffers: Linear read, 64k blocks: 134 MB/s random read, 64k blocks: 134 MB/s linear read, 4k blocks: 87 MB/s random read, 4k blocks: 87 MB/s linear write, 64k blocks: 107 MB/s random write, 64k blocks: 110 MB/s linear write, 4k blocks: 76 MB/s random write, 4k blocks: 32 MB/s cache enabled, 1 buffer: linear write, 4k blocks: 51 MB/s (12800 ops/s) random write, 4k blocks: 7 MB/s (1750 ops/s) linear write, 64k blocks: 106 MB/s (1610 ops/s) random write, 64k blocks: 59 MB/s (920 ops/s) cache disabled, 1 buffer: linear write, 4k blocks: 4.2 MB/s (1050 ops/s) random write, 4k blocks: 3.9 MB/s (980 ops/s) linear write, 64k blocks: 40 MB/s (650 ops/s) random write, 64k blocks: 40 MB/s (650 ops/s) cache disabled, 32 buffers: linear write, 4k blocks: 4.5 MB/s, 1120 ops/s random write, 4k blocks: 4.2 MB/s, 1050 ops/s linear write, 64k blocks: 43 MB/s, 680 ops/s random write, 64k blocks: 44 MB/s, 690 ops/s cache enabled, 1 buffer, with cache flushes linear write, 4k blocks, flush after every write: 1.5 MB/s, 385 writes/s linear write, 4k blocks, flush after every 4th write: 4.2 MB/s, 1120 writes/s The numbers are rough numbers read quickly from iostat, so please don't multiply block size by ops and compare with the bandwidth given ;) The test operates directly on top of LDI, just like ZFS. - "nk blocks" means the size of each read/write given to the device driver - "n buffers" means the number of buffers I keep in flight. This is to keep the command queue of the device busy - "cache flush" means a synchronous ioctl DKIOCFLUSHWRITECACHE These numbers contain a few surprises (at least for me). The biggest surprise is that with cache disabled one cannot get good data rates with small blocks, even if one keeps the command queue filled. This is completely different from what I've seen from hard drives. Also the IOPS with cache flushes is quite low, 385 is not much better than a 15k hdd, while the latter scales better. On the other hand, from the large drop in performance when using flushes one could infer that they indeed flush properly, but I haven't built a test setup for that yet. Conclusion: From the measurements I'd infer the device makes a good L2ARC, but for a slog device the latency is too high and it doesn't scale well. I'll do similar tests on a x-25 and ocz vertex 2 pro as soon as they arrive. If there are numbers you are missing please tell me, I'll measure them if possible. Also please ask if there are questions regarding the test setup. -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On 23/06/2010 18:50, Adam Leventhal wrote: Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? No. There are always smaller writes to metadata that will distribute parity. What is the total width of your raidz1 stripe? 4x disks, 16KB recordsize, 128GB file, random read with 16KB block. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On 23/06/2010 19:29, Ross Walker wrote: On Jun 23, 2010, at 1:48 PM, Robert Milkowski wrote: 128GB. Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? What's the record size on those datasets? 8k? 16K ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?
Arne Jansen wrote: > Hi, > > Roy Sigurd Karlsbakk wrote: >> Crucial RealSSD C300 has been released and showing good numbers for use as >> Zil and L2ARC. Does anyone know if this unit flushes its cache on request, >> as opposed to Intel units etc? >> > > I had a chance to get my hands on a Crucial RealSSD C300/128MB yesterday and > did > some quick testing. Here are the numbers first, some explanation follows > below: After taemun alerted my that the linear read/write numbers are too low I found a bottleneck: the controller decided to connect the SSD with only 1.5GBit. I have to check if we can jumper it to least 3GBit. To connect it with 6GBit we need some new cables, so this might take some time. The main purpose of this test was to evaluate the SSD with respect to usage as a slog device and I think the connection speed doesn't affect this. Nevertheless I'll repeat the tests as soon as we solved the issues. Sorry. --Arne > > cache enabled, 32 buffers: > Linear read, 64k blocks: 134 MB/s > random read, 64k blocks: 134 MB/s > linear read, 4k blocks: 87 MB/s > random read, 4k blocks: 87 MB/s > linear write, 64k blocks: 107 MB/s > random write, 64k blocks: 110 MB/s > linear write, 4k blocks: 76 MB/s > random write, 4k blocks: 32 MB/s > > cache enabled, 1 buffer: > linear write, 4k blocks: 51 MB/s (12800 ops/s) > random write, 4k blocks: 7 MB/s (1750 ops/s) > linear write, 64k blocks: 106 MB/s (1610 ops/s) > random write, 64k blocks: 59 MB/s (920 ops/s) > > cache disabled, 1 buffer: > linear write, 4k blocks: 4.2 MB/s (1050 ops/s) > random write, 4k blocks: 3.9 MB/s (980 ops/s) > linear write, 64k blocks: 40 MB/s (650 ops/s) > random write, 64k blocks: 40 MB/s (650 ops/s) > > cache disabled, 32 buffers: > linear write, 4k blocks: 4.5 MB/s, 1120 ops/s > random write, 4k blocks: 4.2 MB/s, 1050 ops/s > linear write, 64k blocks: 43 MB/s, 680 ops/s > random write, 64k blocks: 44 MB/s, 690 ops/s > > cache enabled, 1 buffer, with cache flushes > linear write, 4k blocks, flush after every write: 1.5 MB/s, 385 writes/s > linear write, 4k blocks, flush after every 4th write: 4.2 MB/s, 1120 writes/s > > > The numbers are rough numbers read quickly from iostat, so please don't > multiply block size by ops and compare with the bandwidth given ;) > The test operates directly on top of LDI, just like ZFS. > - "nk blocks" means the size of each read/write given to the device driver > - "n buffers" means the number of buffers I keep in flight. This is to keep >the command queue of the device busy > - "cache flush" means a synchronous ioctl DKIOCFLUSHWRITECACHE > > These numbers contain a few surprises (at least for me). The biggest surprise > is that with cache disabled one cannot get good data rates with small blocks, > even if one keeps the command queue filled. This is completely different from > what I've seen from hard drives. > Also the IOPS with cache flushes is quite low, 385 is not much better than > a 15k hdd, while the latter scales better. On the other hand, from the large > drop in performance when using flushes one could infer that they indeed flush > properly, but I haven't built a test setup for that yet. > > Conclusion: From the measurements I'd infer the device makes a good L2ARC, > but for a slog device the latency is too high and it doesn't scale well. > > I'll do similar tests on a x-25 and ocz vertex 2 pro as soon as they arrive. > > If there are numbers you are missing please tell me, I'll measure them if > possible. Also please ask if there are questions regarding the test setup. > > -- > Arne > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] c5->c9 device name change prevents beadm activate
Lori, In my case what may have caused the problem is that after a previous upgrade failed, I used this zfs send/recv procedure to give me (what I thought was) a sane rpool: http://blogs.sun.com/migi/entry/broken_opensolaris_never Is it possible that a zfs recv of a root pool contains the device names from the sending hardware? On 06/23/10 18:15, Lori Alt wrote: Cindy Swearingen wrote: On 06/23/10 10:40, Evan Layton wrote: On 6/23/10 4:29 AM, Brian Nitz wrote: I saw a problem while upgrading from build 140 to 141 where beadm activate {build141BE} failed because installgrub failed: # BE_PRINT_ERR=true beadm activate opensolarismigi-4 be_do_installgrub: installgrub failed for device c5t0d0s0. Unable to activate opensolarismigi-4. Unknown external error. The reason installgrub failed is that it is attempting to install grub on c5t0d0s0 which is where my root pool is: # zpool status pool: rpool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scan: scrub repaired 0 in 5h3m with 0 errors on Tue Jun 22 22:31:08 2010 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c5t0d0s0 ONLINE 0 0 0 errors: No known data errors But the raw device doesn't exist: # ls -ls /dev/rdsk/c5* /dev/rdsk/c5*: No such file or directory Even though zfs pool still sees it as c5, the actual device seen by format is c9t0d0s0 Is there any workaround for this problem? Is it a bug in install, zfs or somewhere else in ON? In this instance beadm is a victim of the zpool configuration reporting the wrong device. This does appear to be a ZFS issue since the device actually being used is not what zpool status is reporting. I'm forwarding this on to the ZFS alias to see if anyone has any thoughts there. -evan Hi Evan, I suspect that some kind of system, hardware, or firmware event changed this device name. We could identify the original root pool device with the zpool history output from this pool. Brian, you could boot this system from the OpenSolaris LiveCD and attempt to import this pool to see if that will update the device info correctly. If that doesn't help, then create /dev/rdsk/c5* symlinks to point to the correct device. I've seen this kind of device name change in a couple contexts now related to installs, image-updates, etc. I think we need to understand why this is happening. Prior to OpenSolaris and the new installer, we used to go to a fair amount of trouble to make sure that device names, once assigned, never changed. Various parts of the system depended on device names remaining the same across upgrades and other system events. Does anyone know why these device names are changing? Because that seems like the root of the problem. Creating symlinks with the old names seems like a band-aid, which could cause problems down the road--what if some other device on the system gets assigned that name on a future update? Lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On Jun 24, 2010, at 5:40 AM, Robert Milkowski wrote: > On 23/06/2010 18:50, Adam Leventhal wrote: >>> Does it mean that for dataset used for databases and similar environments >>> where basically all blocks have fixed size and there is no other data all >>> parity information will end-up on one (z1) or two (z2) specific disks? >>> >> No. There are always smaller writes to metadata that will distribute parity. >> What is the total width of your raidz1 stripe? >> >> > > 4x disks, 16KB recordsize, 128GB file, random read with 16KB block. >From what I gather each 16KB record (plus parity) is spread across the raidz >disks. This causes the total random IOPS (write AND read) of the raidz to be >that of the slowest disk in the raidz. Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?
Arne Jansen wrote: > Hi, > > Roy Sigurd Karlsbakk wrote: >> Crucial RealSSD C300 has been released and showing good numbers for use as >> Zil and L2ARC. Does anyone know if this unit flushes its cache on request, >> as opposed to Intel units etc? >> > > Also the IOPS with cache flushes is quite low, 385 is not much better than > a 15k hdd, while the latter scales better. On the other hand, from the large > drop in performance when using flushes one could infer that they indeed flush > properly, but I haven't built a test setup for that yet. > Result from cache flush test: While doing synchronous writes with full speed we pulled the device from the system and compared the contents afterwards. Result: no writes lost. We repeated the test several times. Cross check: we pulled also while writing with cache enabled, and it lost 8 writes. So I'd say, yes, it flushes its cache on request. -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?
On Thu, June 24, 2010 08:58, Arne Jansen wrote: > Cross check: we pulled also while writing with cache enabled, and it lost > 8 writes. I'm SO pleased to see somebody paranoid enough to do that kind of cross-check doing this benchmarking! "Benchmarking is hard!" > So I'd say, yes, it flushes its cache on request. Starting to sound pretty convincing, yes. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On 24/06/2010 14:32, Ross Walker wrote: On Jun 24, 2010, at 5:40 AM, Robert Milkowski wrote: On 23/06/2010 18:50, Adam Leventhal wrote: Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? No. There are always smaller writes to metadata that will distribute parity. What is the total width of your raidz1 stripe? 4x disks, 16KB recordsize, 128GB file, random read with 16KB block. From what I gather each 16KB record (plus parity) is spread across the raidz disks. This causes the total random IOPS (write AND read) of the raidz to be that of the slowest disk in the raidz. Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. I know that and it wasn't mine question. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On Thu, 24 Jun 2010, Ross Walker wrote: Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. Remarkably, I have yet to see mention of someone testing a raidz which is comprised entirely of FLASH SSDs. This should help with the IOPS, particularly when reading. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs failsafe pool mismatch
I have a customer that described this issue to me in general terms. I'd like to know how to replicated it, and what the best practice is to a avoid the issue, or fix it in an accepted manner. If they kernel patch, and reboot they may get messages informing them that the pool version is down rev'd. If they act on the message and upgrade the pool version, then have to boot from the failsafe it fails as that kernel does not support that pool version. What would be a way to fix this, and should we allow they catch to even happen? Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On 24/06/2010 15:54, Bob Friesenhahn wrote: On Thu, 24 Jun 2010, Ross Walker wrote: Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. Remarkably, I have yet to see mention of someone testing a raidz which is comprised entirely of FLASH SSDs. This should help with the IOPS, particularly when reading. I have. Briefly: X4270 2x Quad-core 2.93GHz, 72GB RAM Open Solaris 2009.06 (snv_111b) ARC limited to 4GB 44x SSD in a F5100. 4x SAS HBAs, 4x physical SAS connections to the f5100 (16x SAS channels in total), each to a different domain. 1. RAID-10 pool 22x mirrors across domains ZFS: 16KB recordsize, atime=off randomread filebennch benchmark with a 16KB block size with 1, 16, ..., 128 threads, 128GB working set. maximum performance when 128 threads: ~137,000 ops/s 2. RAID-Z pool 11x 4-way RAID-z, each raid-z vdev across domains ZFS: recordsize=16k, atime=off randomread filebennch benchmark with a 16KB block size with 1, 16, ..., 128 threads, 128GB working set. maximum performance when 64-128 threads: ~34,000 ops/s With a ZFS recordsize of 32KB it got up-to ~41,000 ops/s. Larger ZFS record sizes produced worse results. RAID-Z delivered about 3.3X less ops/s compared to RAID-10 here. SSDs do not make any fundamental chanage here and RAID-Z characteristics are basically the same whether it is configured out of SSDs or HDDs. However SSDs could of course provide a good-enough performance even with RAID-Z, as at the end of a day it is not about benchmarks but your environment requirements. A given number of SSDs in a RAID-Z configuration is able to deliver the same performance as a much greater number of disk drives in RAID-10 configuration and if you don't need much space it could make sense. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS forensics/revert/restore shellscript and how-to.
Where is the link to the script, and does it work with RAIDZ arrays? Thanks so much. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs failsafe pool mismatch
Hi Shawn, I think this can happen if you apply patch 141445-09. It should not happen in the future. I believe the workaround is this: 1. Boot the system from the correct media. 2. Install the boot blocks on the root pool disk(s). 3. Upgrade the pool. Thanks, Cindy On 06/24/10 09:24, Shawn Belaire wrote: I have a customer that described this issue to me in general terms. I'd like to know how to replicated it, and what the best practice is to a avoid the issue, or fix it in an accepted manner. If they kernel patch, and reboot they may get messages informing them that the pool version is down rev'd. If they act on the message and upgrade the pool version, then have to boot from the failsafe it fails as that kernel does not support that pool version. What would be a way to fix this, and should we allow they catch to even happen? Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Filesystem Recovery on RAIDZ Array
This day went from usual Thursday to worst day of my life in the span of about 10 seconds. Here's the scenario: 2 Computer, both Solaris 10u8, one is the primary, one is the backup. Primary system is RAIDZ2, Backup is RAIDZ with 4 drives. Every night, Primary mirrors to Backup using the 'zfs send' command. The Backup receives that with 'zfs recv -vFd'. This ensures that both machines have an identical set of filesystems/snapshots every night. (Snapshots are taken on Primary every hour during the workday). The issue began monday when Primary failed. After restoring it to operating condition I began restoring the filesystems from Backup, again using ZFS send/recv. By midnight, only about half of the data had recovered, at which point Primary attempted its regularly schedule mirror operation with Backup. One of our primary ZFS filesystems had not yet been restored, and since it wasn't on Primary when the mirror operation began, 'zfs recv' destroyed it on the Backup system. AH. So, in short, a RAIDZ array contained 7 ZFS filesystems + dozens of snapshots in one RAIDZ pool. 12 hours ago some of those filesystems were destroyed, effectively by a zfs destroy command (executed by zfs recv). No data has been written to that pool since then. Is there anyway to revert it to the state it was in 12 hours ago? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Gonna be stupid here...
But it's early (for me), and I can't remember the answer here. I'm sizing an Oracle database appliance. I'd like to get one of the F20 96GB flash accellerators to play with, but I can't imagine I'd be using the whole thing for ZIL. The DB is likely to be a couple TB in size. Couple of questions: (a) since everything is going to be zvols, and I'm going to be doing lots of sync writes to them, I'm thinking that allocating around a dozen GB of the F20's flash would be useful. :-) (b) can zvols still make use of an L2ARC device for their pool? I'm assuming so, since it's both block and metadata that get stored there. I'm considering adding a couple of very large SSDs to I might be able to cache most of my DB in the L2ARC, if that works. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On Jun 24, 2010, at 10:42 AM, Robert Milkowski wrote: > On 24/06/2010 14:32, Ross Walker wrote: >> On Jun 24, 2010, at 5:40 AM, Robert Milkowski wrote: >> >> >>> On 23/06/2010 18:50, Adam Leventhal wrote: >>> > Does it mean that for dataset used for databases and similar environments > where basically all blocks have fixed size and there is no other data all > parity information will end-up on one (z1) or two (z2) specific disks? > > No. There are always smaller writes to metadata that will distribute parity. What is the total width of your raidz1 stripe? >>> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block. >>> >> From what I gather each 16KB record (plus parity) is spread across the raidz >> disks. This causes the total random IOPS (write AND read) of the raidz to be >> that of the slowest disk in the raidz. >> >> Raidz is definitely made for sequential IO patterns not random. To get good >> random IO with raidz you need a zpool with X raidz vdevs where X = desired >> IOPS/IOPS of single drive. >> > > I know that and it wasn't mine question. Sorry, for the OP... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
Hey Robert, I've filed a bug to track this issue. We'll try to reproduce the problem and evaluate the cause. Thanks for bringing this to our attention. Adam On Jun 24, 2010, at 2:40 AM, Robert Milkowski wrote: > On 23/06/2010 18:50, Adam Leventhal wrote: >>> Does it mean that for dataset used for databases and similar environments >>> where basically all blocks have fixed size and there is no other data all >>> parity information will end-up on one (z1) or two (z2) specific disks? >>> >> No. There are always smaller writes to metadata that will distribute parity. >> What is the total width of your raidz1 stripe? >> >> > > 4x disks, 16KB recordsize, 128GB file, random read with 16KB block. > > -- > Robert Milkowski > http://milek.blogspot.com > > -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Gonna be stupid here...
On 24/06/2010 17:49, Erik Trimble wrote: But it's early (for me), and I can't remember the answer here. I'm sizing an Oracle database appliance. I'd like to get one of the F20 96GB flash accellerators to play with, but I can't imagine I'd be using the whole thing for ZIL. The DB is likely to be a couple TB in size. Couple of questions: (a) since everything is going to be zvols, and I'm going to be doing lots of sync writes to them, I'm thinking that allocating around a dozen GB of the F20's flash would be useful. :-) (b) can zvols still make use of an L2ARC device for their pool? I'm assuming so, since it's both block and metadata that get stored there. I'm considering adding a couple of very large SSDs to I might be able to cache most of my DB in the L2ARC, if that works. Yes, the level that the L2ARC works at doesn't care if the dataset is filesystem or ZVOL. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
Ross Walker wrote: Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. I have seen statements like this repeated several times, though I haven't been able to find an in-depth discussion of why this is the case. From what I've gathered every block (what is the correct term for this? zio block?) written is spread across the whole raid-z. But in what units? will a 4k write be split into 512 byte writes? And in the opposite direction, every block needs to be read fully, even if only parts of it are being requested, because the checksum needs to be checked? Will the parity be read, too? If this is all the case, I can see why raid-z reduces the performance of an array effectively to one device w.r.t. random reads. Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] c5->c9 device name change prevents beadm activate
On 06/24/10 03:27 AM, Brian Nitz wrote: Lori, In my case what may have caused the problem is that after a previous upgrade failed, I used this zfs send/recv procedure to give me (what I thought was) a sane rpool: http://blogs.sun.com/migi/entry/broken_opensolaris_never Is it possible that a zfs recv of a root pool contains the device names from the sending hardware? Yes, the data installed by the zfs recv will contain the device names from the sending hardware. I looked at the instructions in the blog you reference above and while the procedure *might* work in some circumstances, it would mostly be by accident. Maybe if there is an exact match of hardware, it might work, but there's also metadata that describes the BEs on a system and I doubt whether the send/recv would restore all the information necessary to do that. You might want to bring this subject up on the caiman-disc...@opensolaris.org alias, where needs like this can be addressed for real, in the supported installation tools. Lori On 06/23/10 18:15, Lori Alt wrote: Cindy Swearingen wrote: On 06/23/10 10:40, Evan Layton wrote: On 6/23/10 4:29 AM, Brian Nitz wrote: I saw a problem while upgrading from build 140 to 141 where beadm activate {build141BE} failed because installgrub failed: # BE_PRINT_ERR=true beadm activate opensolarismigi-4 be_do_installgrub: installgrub failed for device c5t0d0s0. Unable to activate opensolarismigi-4. Unknown external error. The reason installgrub failed is that it is attempting to install grub on c5t0d0s0 which is where my root pool is: # zpool status pool: rpool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scan: scrub repaired 0 in 5h3m with 0 errors on Tue Jun 22 22:31:08 2010 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c5t0d0s0 ONLINE 0 0 0 errors: No known data errors But the raw device doesn't exist: # ls -ls /dev/rdsk/c5* /dev/rdsk/c5*: No such file or directory Even though zfs pool still sees it as c5, the actual device seen by format is c9t0d0s0 Is there any workaround for this problem? Is it a bug in install, zfs or somewhere else in ON? In this instance beadm is a victim of the zpool configuration reporting the wrong device. This does appear to be a ZFS issue since the device actually being used is not what zpool status is reporting. I'm forwarding this on to the ZFS alias to see if anyone has any thoughts there. -evan Hi Evan, I suspect that some kind of system, hardware, or firmware event changed this device name. We could identify the original root pool device with the zpool history output from this pool. Brian, you could boot this system from the OpenSolaris LiveCD and attempt to import this pool to see if that will update the device info correctly. If that doesn't help, then create /dev/rdsk/c5* symlinks to point to the correct device. I've seen this kind of device name change in a couple contexts now related to installs, image-updates, etc. I think we need to understand why this is happening. Prior to OpenSolaris and the new installer, we used to go to a fair amount of trouble to make sure that device names, once assigned, never changed. Various parts of the system depended on device names remaining the same across upgrades and other system events. Does anyone know why these device names are changing? Because that seems like the root of the problem. Creating symlinks with the old names seems like a band-aid, which could cause problems down the road--what if some other device on the system gets assigned that name on a future update? Lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On 24/06/2010 20:52, Arne Jansen wrote: Ross Walker wrote: Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. I have seen statements like this repeated several times, though I haven't been able to find an in-depth discussion of why this is the case. From what I've gathered every block (what is the correct term for this? zio block?) written is spread across the whole raid-z. But in what units? will a 4k write be split into 512 byte writes? And in the opposite direction, every block needs to be read fully, even if only parts of it are being requested, because the checksum needs to be checked? Will the parity be read, too? If this is all the case, I can see why raid-z reduces the performance of an array effectively to one device w.r.t. random reads. http://blogs.sun.com/roch/entry/when_to_and_not_to -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] One dataset per user?
On Tue, 22 Jun 2010, Arne Jansen wrote: > We found that the zfs utility is very inefficient as it does a lot of > unnecessary and costly checks. Hmm, presumably somebody at Sun doesn't agree with that assessment or you'd think they'd take them out :). Mounting/sharing by hand outside of the zfs framework does make a huge difference. It takes about 45 minutes to mount/share or unshare/unmount with the mountpoint and sharenfs zfs properties set, mounting/sharing by hand with SHARE_NOINUSE_CHECK=1 even just sequentially only took about 2 minutes. With some parallelization I could definitely see hitting that 10 seconds you mentioned, which would sure make my patch windows a hell of a lot shorter. I'll need put together a script and fiddle some with smf, joy oh joy, I need these filesystems mounted before the web server starts. Thanks much for the tip! I'm hoping someday they'll clean up the sharing implementation and make it a bit more scalable. I had a ticket open once and they pretty much said it would never happen for Solaris 10, but maybe sometime in the indefinite future for OpenSolaris... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | hen...@csupomona.edu California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss