Re: [zfs-discuss] Wired write performance problem
On 08 June, 2011 - Donald Stahl sent me these 0,6K bytes: > >> One day, the write performance of zfs degrade. > >> The write performance decrease from 60MB/s to about 6MB/s in sequence > >> write. > >> > >> Command: > >> date;dd if=/dev/zero of=block bs=1024*128 count=1;date > > See this thread: > > http://www.opensolaris.org/jive/thread.jspa?threadID=139317&tstart=45 > > And search in the page for: > "metaslab_min_alloc_size" > > Try adjusting the metaslab size and see if it fixes your performance problem. And if pool usage is >90%, then there's another problem (change of finding free space algorithm). /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
Hi, also see; http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg45408.html We hit this with Sol11 though, not sure if it's possible with sol10 Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Ding Honghui Sent: 8. kesäkuuta 2011 6:07 To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] Wired write performance problem Hi, I got a wired write performance and need your help. One day, the write performance of zfs degrade. The write performance decrease from 60MB/s to about 6MB/s in sequence write. Command: date;dd if=/dev/zero of=block bs=1024*128 count=1;date The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks. The OS is Solaris 10U8, zpool version 15 and zfs version 4. I run Dtrace to trace the write performance: fbt:zfs:zfs_write:entry { self->ts = timestamp; } fbt:zfs:zfs_write:return /self->ts/ { @time = quantize(timestamp-self->ts); self->ts = 0; } It shows value - Distribution - count 8192 | 0 16384 | 16 32768 | 3270 65536 |@@@ 898 131072 |@@@ 985 262144 | 33 524288 | 1 1048576 | 1 2097152 | 3 4194304 | 0 8388608 |@180 16777216 | 33 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 1 536870912 | 1 1073741824 | 2 2147483648 | 0 4294967296 | 0 8589934592 | 0 17179869184 | 2 34359738368 | 3 68719476736 | 0 Compare to a working well storage(1 MD3000), the max write time of zfs_write is 4294967296, it is about 10 times faster. Any suggestions? Thanks Ding ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
On 06/08/2011 12:12 PM, Donald Stahl wrote: One day, the write performance of zfs degrade. The write performance decrease from 60MB/s to about 6MB/s in sequence write. Command: date;dd if=/dev/zero of=block bs=1024*128 count=1;date See this thread: http://www.opensolaris.org/jive/thread.jspa?threadID=139317&tstart=45 And search in the page for: "metaslab_min_alloc_size" Try adjusting the metaslab size and see if it fixes your performance problem. -Don "metaslab_min_alloc_size" is not in use when block allocator isDynamic block allocator[1]. So it is not tunable parameter in my case. Thanks anyway. [1] http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c#496 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
For now, I find it take long time in function metaslab_block_picker in metaslab.c. I guess there maybe many avl search actions. I still not sure what cause the avl search and if there is any parameters to tune for it. Any suggestions? On 06/08/2011 05:57 PM, Markus Kovero wrote: Hi, also see; http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg45408.html We hit this with Sol11 though, not sure if it's possible with sol10 Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Ding Honghui Sent: 8. kesäkuuta 2011 6:07 To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] Wired write performance problem Hi, I got a wired write performance and need your help. One day, the write performance of zfs degrade. The write performance decrease from 60MB/s to about 6MB/s in sequence write. Command: date;dd if=/dev/zero of=block bs=1024*128 count=1;date The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks. The OS is Solaris 10U8, zpool version 15 and zfs version 4. I run Dtrace to trace the write performance: fbt:zfs:zfs_write:entry { self->ts = timestamp; } fbt:zfs:zfs_write:return /self->ts/ { @time = quantize(timestamp-self->ts); self->ts = 0; } It shows value - Distribution - count 8192 | 0 16384 | 16 32768 | 3270 65536 |@@@ 898 131072 |@@@ 985 262144 | 33 524288 | 1 1048576 | 1 2097152 | 3 4194304 | 0 8388608 |@180 16777216 | 33 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 1 536870912 | 1 1073741824 | 2 2147483648 | 0 4294967296 | 0 8589934592 | 0 17179869184 | 2 34359738368 | 3 68719476736 | 0 Compare to a working well storage(1 MD3000), the max write time of zfs_write is 4294967296, it is about 10 times faster. Any suggestions? Thanks Ding ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
On 06/08/2011 04:05 PM, Tomas Ögren wrote: On 08 June, 2011 - Donald Stahl sent me these 0,6K bytes: One day, the write performance of zfs degrade. The write performance decrease from 60MB/s to about 6MB/s in sequence write. Command: date;dd if=/dev/zero of=block bs=1024*128 count=1;date See this thread: http://www.opensolaris.org/jive/thread.jspa?threadID=139317&tstart=45 And search in the page for: "metaslab_min_alloc_size" Try adjusting the metaslab size and see if it fixes your performance problem. And if pool usage is>90%, then there's another problem (change of finding free space algorithm). /Tomas Tomas, Thanks for your suggestion. You are right. I have tune parameter metaslab_df_free_pct from 35 to 4 to reduce this problem some days ago. The performance keep good for about 1 week and performance degrade again. And I still not sure how many operation run into best fit block allocate policy and how many run into fist fit block allocate policy in current situation. It's very appreciate if you can help. Regards, Ding ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool import crashs SX11 trying to recovering a corrupted zpool
Hi I got following problem: duing a controller (LSI MegaRAID 9261-8i) outage I goot a Solaris Express 11 zpool corrupted. It is a whole 1,3 TB rpool zpool, RAID5 made by controller. Changing damaged controller the new one reports it be OPTIMAL. Important, zpool has got dedup enabled. If I try to: - boot it normally - rollback some ZILs with (launched by SXCD in rescue mode) - starting re-install from SXCD I got everytime system crash and instantelly reboot. Then I tried to check/manipulate it from OpenIndiana v148, and v151b, CDs in rescue mode. Trying , it reported zpool is present, but it has got a newer formating version. Launching zdb, it reporting me that label and other zpool information seem OK. But Solaris Express 11 has got zpool version 31, OpenIndiana (version 151 beta too) reachs only version 28. If standard SX11 is crashing seeing this zpool, and OI can't manipulate this newer zpool version, how can try to fix it via (o via whatever other tools)? Exist a newer (patched) version of SXCD different from standard one downloadable from Oracle web site? Exist any "indipendent" Solaris distribution implementing zpool v31? Otherwise, have you got some other workaround to fix it? Thank you very much Stefano -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SE11 express Encryption on - > errors in the pool after Scrub
Ok, I tested it. It made two Scrubs with open encrypted folders. No issues anymore. Thanks for the hint. Hope that will be fixed for all soon. Cheers Am 06.06.2011, 11:54 Uhr, schrieb Darren J Moffat : On 06/04/11 13:52, Thomas Hobbes wrote: I am testing Solaris Express 11 with napp-it on two machines. In both cases the same problem: Enabling encryption on a folder, filling it with data will result in errors indicated by a subsequent scrub. I did not find the topic on the web, but also not experiences shared by people using encryption on SE11 express. Advice would be highly appreciated. If you are doing the scrub when the encryption keys are not present it is possible you are hitting a known (and very recently fixed in the Solaris 11 development gates) bug. If you have an operating systems support contract with Oracle you should be able to log a support ticket and request a backport of the fix for CR 6989185. -- Erstellt mit Operas revolutionärem E-Mail-Modul: http://www.opera.com/mail/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and poor read performance
> > Are some of the reads sequential? Sequential reads > don't go to L2ARC. > > That'll be it. I assume the L2ARC is just taking > metadata. In situations > such as mine, I would quite like the option of > routing sequential read > data to the L2ARC also. The good news is that it is almost a certaintly that actual iSCSI usage will be of a (more) random nature than your tests, suggesting higher L2ARC usage in real world application. I'm not sure how zfs makes the distinction between a random and sequential read, but the more you think about it, not caching sequential requests makes sense. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and poor read performance
On 08/06/2011 14:35, Marty Scholes wrote: Are some of the reads sequential? Sequential reads don't go to L2ARC. That'll be it. I assume the L2ARC is just taking metadata. In situations such as mine, I would quite like the option of routing sequential read data to the L2ARC also. The good news is that it is almost a certaintly that actual iSCSI usage will be of a (more) random nature than your tests, suggesting higher L2ARC usage in real world application. I'm not sure how zfs makes the distinction between a random and sequential read, but the more you think about it, not caching sequential requests makes sense. Yes, in most cases, but I can think of some counter examples ;) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
On 06/08/2011 09:15 PM, Donald Stahl wrote: "metaslab_min_alloc_size" is not in use when block allocator isDynamic block allocator[1]. So it is not tunable parameter in my case. May I ask where it says this is not a tunable in that case? I've read through the code and I don't see what you are talking about. The problem you are describing- including the "long time in function metaslab_block_picker" exactly matches the block picker trying to find a large enough block and failing. What value do you get when you run: echo "metaslab_min_alloc_size/K" | mdb -kw ? You can always try setting it via: echo "metaslab_min_alloc_size/Z 1000" | mdb -kw and if that doesn't work set it right back. I'm not familiar with the specifics of Solaris 10u8 so perhaps this is not a tunable in that version but if it is- I would suggest you try changing it. If your performance is as bad as you say then it can't hurt to try it. -Don Thanks very much, Don. In Solaris 10u8: root@nas-hz-01:~# uname -a SunOS nas-hz-01 5.10 Generic_141445-09 i86pc i386 i86pc root@nas-hz-01:~# echo "metaslab_min_alloc_size/K" | mdb -kw mdb: failed to dereference symbol: unknown symbol name root@nas-hz-01:~# The pool version is 15 and zfs version is 4. And this parameter is valid in my openindiana build 148, it's zpool version is 28 and zfs version is 5. ops@oi:~$ echo "metaslab_min_alloc_size/Z 1000" | pfexec mdb -kw metaslab_min_alloc_size:0x1000 = 0x1000 ops@oi:~$ I'm not sure which version introduce the parameter. Should I run this openindiana? Any suggestions? Regards, Ding ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] RealSSD C300 -> Crucial CT064M4SSD2
Anyone running a Crucial CT064M4SSD2? Any good, or should I try getting a RealSSD C300, as long as these are still available? -- Eugen* Leitl http://leitl.org";>leitl http://leitl.org __ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
> In Solaris 10u8: > root@nas-hz-01:~# uname -a > SunOS nas-hz-01 5.10 Generic_141445-09 i86pc i386 i86pc > root@nas-hz-01:~# echo "metaslab_min_alloc_size/K" | mdb -kw > mdb: failed to dereference symbol: unknown symbol name Fair enough. I don't have anything older than b147 at this point so I wasn't sure if that was in there or not. If you delete a bunch of data (perhaps old files you have laying around) does your performance go back up- even if temporarily? The problem we had matches your description word for word. All of a sudden we had terrible write performance with a ton of time spent in the metaslab allocator. Then we'd delete a big chunk of data (100 gigs or so) and poof- performance would get better for a short while. Several people suggested changing the allocation free percent from 30 to 4 but that change was already incorporated into the b147 box we were testing. The only thing that made a difference (and I mean a night and day difference) was the change above. That said- I have no idea how that part of the code works in 10u8. -Don ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
On 06/08/11 01:05, Tomas Ögren wrote: And if pool usage is>90%, then there's another problem (change of finding free space algorithm). Another (less satisfying) workaround is to increase the amount of free space in the pool, either by reducing usage or adding more storage. Observed behavior is that allocation is fast until usage crosses a threshhold, then performance hits a wall. I have a small sample size (maybe 2-3 samples), but the threshhold point varies from pool to pool but tends to be consistent for a given pool. I suspect some artifact of layout/fragmentation is at play. I've seen things hit the wall at as low as 70% on one pool. The original poster's pool is about 78% full. If possible, try freeing stuff until usage goes back under 75% or 70% and see if your performance returns. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and poor read performance
On Jun 7, 2011, at 9:12 AM, Phil Harman wrote: > Ok here's the thing ... > > A customer has some big tier 1 storage, and has presented 24 LUNs (from four > RAID6 groups) to an OI148 box which is acting as a kind of iSCSI/FC bridge > (using some of the cool features of ZFS along the way). The OI box currently > has 32GB configured for the ARC, and 4x 223GB SSDs for L2ARC. It has a dual > port QLogic HBA, and is currently configured to do round-robin MPXIO over two > 4Gbps links. The iSCSI traffic is over a dual 10Gbps card (rather like the > one Sun used to sell). The ARC size is not big enough to hold the data for the L2ARC headers for the size of the L2ARC. > > I've just built a fresh pool, and have created 20x 100GB zvols which are > mapped to iSCSI clients. I have initialised the first 20GB of each zvol with > random data. I've had a lot of success with write performance (e.g. in > earlier tests I had 20 parallel streams writing 100GB each at over 600MB/sec > aggregate), but read performance is very poor. > > Right now I'm just playing with 20 parallel streams of reads from the first > 2GB of each zvol (i.e. 40GB in all). During each run, I see lots of writes to > the L2ARC, but less than a quarter the volume of reads. Yet my FC LUNS are > hot with 1000s of reads per second. This doesn't change from run to run. Why? Writes to the L2ARC devices are throttled to 8 or 16 MB/sec. If the L2ARC fill cannot keep up, the data is unceremoniously evicted. > Surely 20x 2GB of data (and it's associated metadata) will sit nicely in 4x > 223GB SSDs? On Jun 7, 2011, at 12:34 PM, Marty Scholes wrote: > I'll throw out some (possibly bad) ideas. > > Is ARC satisfying the caching needs? 32 GB for ARC should almost cover the > 40GB of total reads, suggesting that the L2ARC doesn't add any value for this > test. > > Are the SSD devices saturated from an I/O standpoint? Put another way, can > ZFS put data to them fast enough? If they aren't taking writes fast enough, > then maybe they can't effectively load for caching. Certainly if they are > saturated for writes they can't do much for reads. > > Are some of the reads sequential? Sequential reads don't go to L2ARC. This is not a true statement. If the primarycache policy is set to the default, all data will be cached in the ARC. > > What does iostat say for the SSD units? What does arc_summary.pl (maybe > spelled differently) say about the ARC / L2ARC usage? How much of the SSD > units are in use as reported in zpool iostat -v? The ARC statistics are nicely documented in arc.c and available as kstats. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
> Another (less satisfying) workaround is to increase the amount of free space > in the pool, either by reducing usage or adding more storage. Observed > behavior is that allocation is fast until usage crosses a threshhold, then > performance hits a wall. We actually tried this solution. We were at 70% usage and performance hit a wall. We figured it was because of the change of fit algorithm so we added 16 2TB disks in mirrors. (Added 16TB to an 18TB pool). It made almost no difference in our pool performance. It wasn't until we told the metaslab allocator to stop looking for such large chunks that the problem went away. > The original poster's pool is about 78% full. If possible, try freeing > stuff until usage goes back under 75% or 70% and see if your performance > returns. Freeing stuff did fix the problem for us (temporarily) but only in an indirect way. When we freed up a bunch of space, the metaslab allocator was able to find large enough blocks to write to without searching all over the place. This would fix the performance problem until those large free blocks got used up. Then- even though we were below the usage problem threshold from earlier- we would still have the performance problem. -Don ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and poor read performance
> This is not a true statement. If the primarycache > policy is set to the default, all data will > be cached in the ARC. Richard, you know this stuff so well that I am hesitant to disagree with you. At the same time, I have seen this myself, trying to load video files into L2ARC without success. > The ARC statistics are nicely documented in arc.c and > available as kstats. And I looked in the source. My C is a little rusty, yet it appears that prefetch items are not stored in L2ARC by default. Prefetches will satisfy a good portion of sequential reads but won't go to L2ARC. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RealSSD C300 -> Crucial CT064M4SSD2
On 08 June, 2011 - Eugen Leitl sent me these 0,5K bytes: > > Anyone running a Crucial CT064M4SSD2? Any good, or should > I try getting a RealSSD C300, as long as these are still > available? Haven't tried any of those, but how about one of these: OCZ Vertex3 (Sandforce SF-2281, sataIII, MLC, to be used for l2arc): shazoo:~# gdd if=/dev/rdsk/c0t5E83A97F98CEFE5Dd0s0 of=/dev/null bs=1024k count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 2.21005 s, 486 MB/s OCZ Vertex2 EX (Sandforce SF-1500, sataII, SLC and supercap, to be used for zil) shazoo:~# gdd if=/dev/rdsk/c0t5E83A97F1471E0A4d0s0 of=/dev/null bs=1024k count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 3.93114 s, 273 MB/s This is in a x4170m2 with Solaris10. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RealSSD C300 -> Crucial CT064M4SSD2
I am running 4 of the 128GB version in our DR environment as L2ARC. I don't have anything bad to say about them. They run quite well. -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Tomas Ögren Sent: Wednesday, June 08, 2011 12:30 PM To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] RealSSD C300 -> Crucial CT064M4SSD2 On 08 June, 2011 - Eugen Leitl sent me these 0,5K bytes: > > Anyone running a Crucial CT064M4SSD2? Any good, or should I try > getting a RealSSD C300, as long as these are still available? Haven't tried any of those, but how about one of these: OCZ Vertex3 (Sandforce SF-2281, sataIII, MLC, to be used for l2arc): shazoo:~# gdd if=/dev/rdsk/c0t5E83A97F98CEFE5Dd0s0 of=/dev/null bs=1024k count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 2.21005 s, 486 MB/s OCZ Vertex2 EX (Sandforce SF-1500, sataII, SLC and supercap, to be used for zil) shazoo:~# gdd if=/dev/rdsk/c0t5E83A97F1471E0A4d0s0 of=/dev/null bs=1024k count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 3.93114 s, 273 MB/s This is in a x4170m2 with Solaris10. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and poor read performance
On Wed, Jun 08, 2011 at 11:44:16AM -0700, Marty Scholes wrote: > And I looked in the source. My C is a little rusty, yet it appears > that prefetch items are not stored in L2ARC by default. Prefetches > will satisfy a good portion of sequential reads but won't go to > L2ARC. Won't go to L2ARC while they're still speculative reads, maybe. Once they're actually used by the app to satisfy a good portion of the actual reads, they'll have hits stats and will. I suspect the problem is the threshold for l2arc writes. Sequential reads can be much faster than this rate, meaning it can take a lot of effort/time to fill. You could test by doing slow sequential reads, and see if the l2arc fills any more for the same reads spread over a longer time. -- Dan. pgp0CnUan5EkQ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
On 06/09/2011 12:23 AM, Donald Stahl wrote: Another (less satisfying) workaround is to increase the amount of free space in the pool, either by reducing usage or adding more storage. Observed behavior is that allocation is fast until usage crosses a threshhold, then performance hits a wall. We actually tried this solution. We were at 70% usage and performance hit a wall. We figured it was because of the change of fit algorithm so we added 16 2TB disks in mirrors. (Added 16TB to an 18TB pool). It made almost no difference in our pool performance. It wasn't until we told the metaslab allocator to stop looking for such large chunks that the problem went away. The original poster's pool is about 78% full. If possible, try freeing stuff until usage goes back under 75% or 70% and see if your performance returns. Freeing stuff did fix the problem for us (temporarily) but only in an indirect way. When we freed up a bunch of space, the metaslab allocator was able to find large enough blocks to write to without searching all over the place. This would fix the performance problem until those large free blocks got used up. Then- even though we were below the usage problem threshold from earlier- we would still have the performance problem. -Don ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Don, From your words, my symptom is almost same with yours. We have examine the metaslab layout, when metaslab_df_free_pct is 35, there are 65 free metaslab(64G), The write performance is very low and the rough test shows no new free metaslab will be loaded and activated. Then we tune the metaslab_df_free_pct to 4, the performance keep good for 1 week and the free metaslab reduce to 51. But now, the write bandwidth is poor again ( maybe I'd better trace the free space of each metaslab? ) Maybe there are some problem in metaslab rating score(weight) for select the metaslab and block allocator algorithm? There is snapshot of metaslab layout, the last 51 metaslabs have 64G free space. vdev offsetspacemap free -- --- --- - ... snip vdev 3 offset 270 spacemap440 free21.0G vdev 3 offset 280 spacemap 31 free7.36G vdev 3 offset 290 spacemap 32 free2.44G vdev 3 offset 2a0 spacemap 33 free2.91G vdev 3 offset 2b0 spacemap 34 free3.25G vdev 3 offset 2c0 spacemap 35 free3.03G vdev 3 offset 2d0 spacemap 36 free3.20G vdev 3 offset 2e0 spacemap 90 free3.28G vdev 3 offset 2f0 spacemap 91 free2.46G vdev 3 offset 300 spacemap 92 free2.98G vdev 3 offset 310 spacemap 93 free2.19G vdev 3 offset 320 spacemap 94 free2.42G vdev 3 offset 330 spacemap 95 free2.83G vdev 3 offset 340 spacemap252 free41.6G vdev 3 offset 350 spacemap 0 free 64G vdev 3 offset 360 spacemap 0 free 64G vdev 3 offset 370 spacemap 0 free 64G vdev 3 offset 380 spacemap 0 free 64G vdev 3 offset 390 spacemap 0 free 64G vdev 3 offset 3a0 spacemap 0 free 64G vdev 3 offset 3b0 spacemap 0 free 64G vdev 3 offset 3c0 spacemap 0 free 64G vdev 3 offset 3d0 spacemap 0 free 64G vdev 3 offset 3e0 spacemap 0 free 64G ...snip ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
On 06/09/2011 10:14 AM, Ding Honghui wrote: On 06/09/2011 12:23 AM, Donald Stahl wrote: Another (less satisfying) workaround is to increase the amount of free space in the pool, either by reducing usage or adding more storage. Observed behavior is that allocation is fast until usage crosses a threshhold, then performance hits a wall. We actually tried this solution. We were at 70% usage and performance hit a wall. We figured it was because of the change of fit algorithm so we added 16 2TB disks in mirrors. (Added 16TB to an 18TB pool). It made almost no difference in our pool performance. It wasn't until we told the metaslab allocator to stop looking for such large chunks that the problem went away. The original poster's pool is about 78% full. If possible, try freeing stuff until usage goes back under 75% or 70% and see if your performance returns. Freeing stuff did fix the problem for us (temporarily) but only in an indirect way. When we freed up a bunch of space, the metaslab allocator was able to find large enough blocks to write to without searching all over the place. This would fix the performance problem until those large free blocks got used up. Then- even though we were below the usage problem threshold from earlier- we would still have the performance problem. -Don ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Don, From your words, my symptom is almost same with yours. We have examine the metaslab layout, when metaslab_df_free_pct is 35, there are 65 free metaslab(64G), The write performance is very low and the rough test shows no new free metaslab will be loaded and activated. Then we tune the metaslab_df_free_pct to 4, the performance keep good for 1 week and the free metaslab reduce to 51. But now, the write bandwidth is poor again ( maybe I'd better trace the free space of each metaslab? ) Maybe there are some problem in metaslab rating score(weight) for select the metaslab and block allocator algorithm? There is snapshot of metaslab layout, the last 51 metaslabs have 64G free space. vdev offsetspacemap free -- --- --- - ... snip vdev 3 offset 270 spacemap440 free 21.0G vdev 3 offset 280 spacemap 31 free 7.36G vdev 3 offset 290 spacemap 32 free 2.44G vdev 3 offset 2a0 spacemap 33 free 2.91G vdev 3 offset 2b0 spacemap 34 free 3.25G vdev 3 offset 2c0 spacemap 35 free 3.03G vdev 3 offset 2d0 spacemap 36 free 3.20G vdev 3 offset 2e0 spacemap 90 free 3.28G vdev 3 offset 2f0 spacemap 91 free 2.46G vdev 3 offset 300 spacemap 92 free 2.98G vdev 3 offset 310 spacemap 93 free 2.19G vdev 3 offset 320 spacemap 94 free 2.42G vdev 3 offset 330 spacemap 95 free 2.83G vdev 3 offset 340 spacemap252 free 41.6G vdev 3 offset 350 spacemap 0 free 64G vdev 3 offset 360 spacemap 0 free 64G vdev 3 offset 370 spacemap 0 free 64G vdev 3 offset 380 spacemap 0 free 64G vdev 3 offset 390 spacemap 0 free 64G vdev 3 offset 3a0 spacemap 0 free 64G vdev 3 offset 3b0 spacemap 0 free 64G vdev 3 offset 3c0 spacemap 0 free 64G vdev 3 offset 3d0 spacemap 0 free 64G vdev 3 offset 3e0 spacemap 0 free 64G ...snip I free up some disk space(about 300GB), the performance is back again. I'm sure the performance will degrade again soon. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
> There is snapshot of metaslab layout, the last 51 metaslabs have 64G free > space. After we added all the disks to our system we had lots of free metaslabs- but that didn't seem to matter. I don't know if perhaps the system was attempting to balance the writes across more of our devices but whatever the reason- the percentage didn't seem to matter. All that mattered was changing the size of the min_alloc tunable. You seem to have gotten a lot deeper into some of this analysis than I did so I'm not sure if I can really add anything. Since 10u8 doesn't support that tunable I'm not really sure where to go from there. If you can take the pool offline, you might try connecting it to a b148 box and see if that tunable makes a difference. Beyond that I don't really have any suggestions. Your problem description, including the return of performance when freeing space is _identical_ to the problem we had. After checking every single piece of hardware, replacing countless pieces, removing COMSTAR and other pieces from the puzzle- the only change that helped was changing that tunable. I wish I could be of more help but I have not had the time to dive into the ZFS code with any gusto. -Don ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss