[zfs-discuss] No write coalescing after upgrade to Solaris 11 Express
Hi All, I've run into a massive performance problem after upgrading to Solaris 11 Express from oSol 134. Previously the server was performing a batch write every 10-15 seconds and the client servers (connected via NFS and iSCSI) had very low wait times. Now I'm seeing constant writes to the array with a very low throughput and high wait times on the client servers. Zil is currently disabled. There is currently one failed disk that is being replaced shortly. Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol 134? I attempted to remove Sol 11 and reinstall 134 but it keeps freezing during install which is probably another issue entirely... IOstat output is below. When running iostat -v 2 that level is writes OP's and throughput is very constant. capacity operationsbandwidth poolalloc free read write read write -- - - - - - - MirrorPool 12.2T 4.11T153 4.63K 6.06M 33.6M mirror1.04T 325G 11416 400K 2.80M c7t0d0 - - 5114 163K 2.80M c7t1d0 - - 6114 237K 2.80M mirror1.04T 324G 10374 426K 2.79M c7t2d0 - - 5108 190K 2.79M c7t3d0 - - 5107 236K 2.79M mirror1.04T 324G 15425 537K 3.15M c7t4d0 - - 7115 290K 3.15M c7t5d0 - - 8116 247K 3.15M mirror1.04T 325G 13412 572K 3.00M c7t6d0 - - 7115 313K 3.00M c7t7d0 - - 6116 259K 3.00M mirror1.04T 324G 13381 580K 2.85M c7t8d0 - - 7111 362K 2.85M c7t9d0 - - 5111 219K 2.85M mirror1.04T 325G 15408 654K 3.10M c7t10d0 - - 7122 336K 3.10M c7t11d0 - - 7123 318K 3.10M mirror1.04T 325G 14461 681K 3.22M c7t12d0 - - 8130 403K 3.22M c7t13d0 - - 6132 278K 3.22M mirror 749G 643G 1279 140K 1.07M c4t14d0 - - 0 0 0 0 c7t15d0 - - 1 83 140K 1.07M mirror1.05T 319G 18333 672K 2.74M c7t16d0 - - 11 96 406K 2.74M c7t17d0 - - 7 96 266K 2.74M mirror1.04T 323G 13353 540K 2.85M c7t18d0 - - 7 98 279K 2.85M c7t19d0 - - 6100 261K 2.85M mirror1.04T 324G 12459 543K 2.99M c7t20d0 - - 7118 285K 2.99M c7t21d0 - - 4119 258K 2.99M mirror1.04T 324G 11431 465K 3.04M c7t22d0 - - 5116 195K 3.04M c7t23d0 - - 6117 272K 3.04M c8t2d00 29.5G 0 0 0 0 cache - - - - - - c8t3d059.4G 3.88M113 64 6.51M 7.31M c8t1d059.5G48K 95 69 5.69M 8.08M Thanks -Matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express
Matthew Anderson wrote: Hi All, I've run into a massive performance problem after upgrading to Solaris 11 Express from oSol 134. Previously the server was performing a batch write every 10-15 seconds and the client servers (connected via NFS and iSCSI) had very low wait times. Now I'm seeing constant writes to the array with a very low throughput and high wait times on the client servers. Zil is currently disabled. How/Why? There is currently one failed disk that is being replaced shortly. Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol 134? What does "zfs get sync" report? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express
NAMEPROPERTY VALUE SOURCE MirrorPool sync disabled local MirrorPool/CCIT sync disabled local MirrorPool/EX01 sync disabled inherited from MirrorPool MirrorPool/EX02 sync disabled inherited from MirrorPool MirrorPool/FileStore1 sync disabled inherited from MirrorPool Sync was disabled on the main pool and then let to inherrit to everything else. The reason for disabled this in the first place was to fix bad NFS write performance (even with Zil on an X25e SSD it was under 1MB/s). I've also tried setting the logbias to throughput and latency but they both perform around the same level. Thanks -Matt -Original Message- From: Andrew Gabriel [mailto:andrew.gabr...@oracle.com] Sent: Wednesday, 27 April 2011 3:41 PM To: Matthew Anderson Cc: 'zfs-discuss@opensolaris.org' Subject: Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express Matthew Anderson wrote: > Hi All, > > I've run into a massive performance problem after upgrading to Solaris 11 > Express from oSol 134. > > Previously the server was performing a batch write every 10-15 seconds and > the client servers (connected via NFS and iSCSI) had very low wait times. Now > I'm seeing constant writes to the array with a very low throughput and high > wait times on the client servers. Zil is currently disabled. How/Why? > There is currently one failed disk that is being replaced shortly. > > Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol > 134? > What does "zfs get sync" report? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express
> Sync was disabled on the main pool and then let to inherrit to everything > else. The > reason for disabled this in the first place was to fix bad NFS > write performance (even with > Zil on an X25e SSD it was under 1MB/s). > I've also tried setting the logbias to throughput and latency but they both > perform > around the same level. > Thanks > -Matt I believe you're hitting bug "7000208: Space map trashing affects NFS write throughput". We also did, and it did impact iscsi as well. If you have enough ram you can try enabling metaslab debug (which makes problem vanish); # echo metaslab_debug/W1 | mdb -kw And calculating amount of ram needed: /usr/sbin/amd64/zdb -mm > /tmp/zdb-mm.out awk '/segments/ {s+=$2}END {printf("sum=%d\n",s)}' zdb_mm.out 93373117 sum of segments 16 VDEVs 116 metaslabs 1856 metaslabs in total 93373117/1856 = 50308 average number of segments per metaslab 50308*1856*64 5975785472 5975785472/1024/1024/1024 5.56 = 5.56 GB Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express
On 27 April, 2011 - Matthew Anderson sent me these 3,2K bytes: > Hi All, > > I've run into a massive performance problem after upgrading to Solaris 11 > Express from oSol 134. > > Previously the server was performing a batch write every 10-15 seconds and > the client servers (connected via NFS and iSCSI) had very low wait times. Now > I'm seeing constant writes to the array with a very low throughput and high > wait times on the client servers. Zil is currently disabled. There is > currently one failed disk that is being replaced shortly. > > Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol > 134? > I attempted to remove Sol 11 and reinstall 134 but it keeps freezing during > install which is probably another issue entirely... > > IOstat output is below. When running iostat -v 2 that level is writes OP's > and throughput is very constant. > >capacity operationsbandwidth > poolalloc free read write read write > -- - - - - - - > MirrorPool 12.2T 4.11T153 4.63K 6.06M 33.6M > mirror1.04T 325G 11416 400K 2.80M > c7t0d0 - - 5114 163K 2.80M > c7t1d0 - - 6114 237K 2.80M > mirror1.04T 324G 10374 426K 2.79M > c7t2d0 - - 5108 190K 2.79M > c7t3d0 - - 5107 236K 2.79M > mirror1.04T 324G 15425 537K 3.15M > c7t4d0 - - 7115 290K 3.15M > c7t5d0 - - 8116 247K 3.15M > mirror1.04T 325G 13412 572K 3.00M > c7t6d0 - - 7115 313K 3.00M > c7t7d0 - - 6116 259K 3.00M > mirror1.04T 324G 13381 580K 2.85M > c7t8d0 - - 7111 362K 2.85M > c7t9d0 - - 5111 219K 2.85M > mirror1.04T 325G 15408 654K 3.10M > c7t10d0 - - 7122 336K 3.10M > c7t11d0 - - 7123 318K 3.10M > mirror1.04T 325G 14461 681K 3.22M > c7t12d0 - - 8130 403K 3.22M > c7t13d0 - - 6132 278K 3.22M > mirror 749G 643G 1279 140K 1.07M > c4t14d0 - - 0 0 0 0 > c7t15d0 - - 1 83 140K 1.07M > mirror1.05T 319G 18333 672K 2.74M > c7t16d0 - - 11 96 406K 2.74M > c7t17d0 - - 7 96 266K 2.74M > mirror1.04T 323G 13353 540K 2.85M > c7t18d0 - - 7 98 279K 2.85M > c7t19d0 - - 6100 261K 2.85M > mirror1.04T 324G 12459 543K 2.99M > c7t20d0 - - 7118 285K 2.99M > c7t21d0 - - 4119 258K 2.99M > mirror1.04T 324G 11431 465K 3.04M > c7t22d0 - - 5116 195K 3.04M > c7t23d0 - - 6117 272K 3.04M > c8t2d00 29.5G 0 0 0 0 Btw, this disk seems alone, unmirrored and a bit small..? > cache - - - - - - > c8t3d059.4G 3.88M113 64 6.51M 7.31M > c8t1d059.5G48K 95 69 5.69M 8.08M > > > Thanks > -Matt > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Lamp Zy > > One of my drives failed in Raidz2 with two hot spares: > What zpool & zfs version are you using? What OS version? Are all the drives precisely the same size (Same make/model number?) and all the same firmware level? Up to some point (I don't know which zpool version) there was a characteristic (some would say a bug) whereby even a byte smaller drive would cause the new drive to be an unsuitable replacement for a failed drive. And it was certainly known to happen sometimes, that a single mfgr & model of drive would occasionally have these tiny variations in supposedly identical drives. But they created a workaround for this in some version of zpool. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive
On 04/26/2011 01:25 AM, Nikola M. wrote: On 04/26/11 01:56 AM, Lamp Zy wrote: Hi, One of my drives failed in Raidz2 with two hot spares: What are zpool/zfs versions? (zpool upgrade Ctrl+c, zfs upgrade Cttr+c). Latest zpool/zfs versions available by numerical designation in all OpenSolaris based distributions, are zpool 28 and zfs v. 5. (That is why one should Not update so S11Ex Zfs/Zpool version if wanting to use/have installed or continue using in multiple Zfs BE's other open OpenSolaris based distributions) What OS are you using with ZFS? Do you use Solaris 10/update release, Solaris11Express, OpenIndiana oi_148 dev/ 148b with IllumOS, OpenSolaris 2009.06/snv_134b, Nexenta, Nexenta Community, Schillix, FreeBSD, Linux zfs-fuse.. (I guess still not using Linux with Zfs kernel module, but just to mention it available.. and OSX too). Thank you for all replies. Here is what we are using. - Hardware: Server: SUN SunFire X4240 DAS Storage: SUN Storage J4400 with 24x1TB SATA drives. Original drives. I assume they are identical. - Software: OS: Solaris 10 5/09 s10x_u7wos_08 X86; Stock install. No upgrades, no patches. ZFS pool version 10 ZFS filesystem version 3 Another confusing thing is that I wasn't able to put the failed drive off-line because there wasn't enough replicas (?). First, the drive already failed and second - it's raidz2 which is equivalent of raid6 and it should be able to handle 2 failed drives. I skipped that step but wanted to mention it here. I used the "zpool replace" and resilvering finished successfully. Then the "zpool detach" removed the drive and now I have this: # zpool status fwgpool0 pool: fwgpool0 state: ONLINE scrub: resilver completed after 12h59m with 0 errors on Wed Apr 27 05:15:17 2011 config: NAME STATE READ WRITE CKSUM fwgpool0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c4t5000C500108B406Ad0 ONLINE 0 0 0 c4t5000C50010F436E2d0 ONLINE 0 0 0 c4t5000C50011215B6Ed0 ONLINE 0 0 0 c4t5000C50011234715d0 ONLINE 0 0 0 c4t5000C50011252B4Ad0 ONLINE 0 0 0 c4t5000C500112749EDd0 ONLINE 0 0 0 c4t5000C50014D70072d0 ONLINE 0 0 0 c4t5000C500112C4959d0 ONLINE 0 0 0 c4t5000C50011318199d0 ONLINE 0 0 0 c4t5000C500113C0E9Dd0 ONLINE 0 0 0 c4t5000C500113D0229d0 ONLINE 0 0 0 c4t5000C500113E97B8d0 ONLINE 0 0 0 c4t5000C50014D065A9d0 ONLINE 0 0 0 c4t5000C50014D0B3B9d0 ONLINE 0 0 0 c4t5000C50014D55DEFd0 ONLINE 0 0 0 c4t5000C50014D642B7d0 ONLINE 0 0 0 c4t5000C50014D64521d0 ONLINE 0 0 0 c4t5000C50014D69C14d0 ONLINE 0 0 0 c4t5000C50014D6B2CFd0 ONLINE 0 0 0 c4t5000C50014D6C6D7d0 ONLINE 0 0 0 c4t5000C50014D6D486d0 ONLINE 0 0 0 c4t5000C50014D6D77Fd0 ONLINE 0 0 0 spares c4t5000C50014D7058Dd0AVAIL errors: No known data errors # Great. So, now how do I identify which drive out of the 24 in the storage unit is the one that failed? I looked on the Internet for help but the problem is that this drive completely disappeared. Even "format" and "iostat -En" show only 23 drives when there are physically 24. Any ideas how to identify which drive is the one that failed so I can replace it? Thanks Peter ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive
On Wed, Apr 27, 2011 at 12:51 PM, Lamp Zy wrote: > Any ideas how to identify which drive is the one that failed so I can > replace it? Try the following: # fmdump -eV # fmadm faulty -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive
On Wed, Apr 27, 2011 at 3:51 PM, Lamp Zy wrote: > Great. So, now how do I identify which drive out of the 24 in the storage > unit is the one that failed? > > I looked on the Internet for help but the problem is that this drive > completely disappeared. Even "format" and "iostat -En" show only 23 drives > when there are physically 24. > > Any ideas how to identify which drive is the one that failed so I can > replace it? We are using CAM to monitor our J4400s and through that interface you can see which drive is in which slot. http://www.oracle.com/us/products/servers-storage/storage/storage-software/031603.htm -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Erik Trimble > > (BTW, is there any way to get a measurement of number of blocks consumed > per zpool? Per vdev? Per zfs filesystem?) *snip*. > > > you need to use zdb to see what the current block usage is for a filesystem. > I'd have to look up the particular CLI usage for that, as I don't know what it is > off the top of my head. Anybody know the answer to that one? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On 27 April, 2011 - Edward Ned Harvey sent me these 0,6K bytes: > > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > > boun...@opensolaris.org] On Behalf Of Erik Trimble > > > > (BTW, is there any way to get a measurement of number of blocks consumed > > per zpool? Per vdev? Per zfs filesystem?) *snip*. > > > > > > you need to use zdb to see what the current block usage is for a > filesystem. > > I'd have to look up the particular CLI usage for that, as I don't know > what it is > > off the top of my head. > > Anybody know the answer to that one? zdb -bb pool /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express
On 4/27/11 4:00 AM, Markus Kovero wrote: Sync was disabled on the main pool and then let to inherrit to everything else. The> reason for disabled this in the first place was to fix bad NFS write performance (even with> Zil on an X25e SSD it was under 1MB/s). I've also tried setting the logbias to throughput and latency but they both perform> around the same level. Thanks -Matt I believe you're hitting bug "7000208: Space map trashing affects NFS write throughput". We also did, and it did impact iscsi as well. If you have enough ram you can try enabling metaslab debug (which makes problem vanish); # echo metaslab_debug/W1 | mdb -kw And calculating amount of ram needed: /usr/sbin/amd64/zdb -mm > /tmp/zdb-mm.out metaslab 65 offset 410 spacemap258 free Assertion failed: space_map_load(sm, zfs_metaslab_ops, SM_FREE, smo, spa->spa_meta_objset) == 0, file ../zdb.c, line 571, function dump_metaslab Is this something I should worry about? uname -a SunOS E55000 5.11 oi_148 i86pc i386 i86pc Solaris awk '/segments/ {s+=$2}END {printf("sum=%d\n",s)}' zdb_mm.out 93373117 sum of segments 16 VDEVs 116 metaslabs 1856 metaslabs in total 93373117/1856 = 50308 average number of segments per metaslab 50308*1856*64 5975785472 5975785472/1024/1024/1024 5.56 = 5.56 GB Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Neil Perrin > > No, that's not true. The DDT is just like any other ZFS metadata and can be > split over the ARC, > cache device (L2ARC) and the main pool devices. An infrequently referenced > DDT block will get > evicted from the ARC to the L2ARC then evicted from the L2ARC. When somebody has their "baseline" system, and they're thinking about adding dedup and/or cache, I'd like to understand the effect of not having enough ram. Obviously the impact will be performance, but precisely... At bootup, I presume the arc & l2arc are all empty. So all the DDT entries reside in pool. As the system reads things (anything, files etc) from pool, it will populate arc, and follow fill rate policies to populate the l2arc over time. Every entry in l2arc requires 200 bytes of arc, regardless of what type of entry it is. (A DDT entry in l2arc consumes just as much arc memory as any other type of l2arc entry.) (Ummm... What's the point of that? Aren't DDT entries 270 bytes and ARC references 200 bytes? Seems like a very questionable benefit to allow DDT entries to get evicted into L2ARC.) So the ram consumption caused by the presence of l2arc will initially be zero after bootup, and it will grow over time as the l2arc populates, up to a maximum which is determined linearly as 200 bytes * the number of entries that can fit in the l2arc. Of course that number varies based on the size of each entry and size of l2arc, but at least you can estimate and establish upper and lower bounds. So that's how the l2arc consumes system memory in arc. The penalty of insufficient ram, in conjunction with enabled L2ARC, is insufficient arc availability for other purposes - Maybe the whole arc is consumed by l2arc entries, and so the arc doesn't have any room for other stuff like commonly used files. Worse yet, your arc consumption could be so large, that PROCESSES don't fit in ram anymore. In this case, your processes get pushed out to swap space, which is really bad. Correct me if I'm wrong, but the dedup sha256 checksum happens in addition to (not instead of) the fletcher2 integrity checksum. So after bootup, while the system is reading a bunch of data from the pool, all those reads are not populating the arc/l2arc with DDT entries. Reads are just populating the arc and l2arc with other stuff. DDT entries don't get into the arc/l2arc until something tries to do a write. When performing a write, dedup calculates the checksum of the block to be written, and then it needs to figure out if that's a duplicate of another block that's already on disk somewhere. So (I guess this part) there's probably a tree-structure (I'll use the subdirectories and files analogy even though I'm certain that's not technically correct) on disk. You need to find the DDT entry, if it exists, for the block whose checksum is 1234ABCD. So you start by looking under the 1 directory, and from there look for the 2 subdirectory, and then the 3 subdirectory, [...etc...] If you encounter "not found" at any step, then the DDT entry doesn't already exist and you decide to create a new one. But if you get all the way down to the C subdirectory and it contains a file named "D," then you have found a possible dedup hit - the checksum matched another block that's already on disk. Now the DDT entry is stored in ARC just like anything else you read from disk. So the point is - Whenever you do a write, and the calculated DDT is not already in ARC/L2ARC, the system will actually perform several small reads looking for the DDT entry before it finally knows that the DDT entry actually exists. So the penalty of performing a write, with dedup enabled, and the relevant DDT entry not already in ARC/L2ARC is a very large penalty. What originated as a single write quickly became several small reads plus a write, due to the fact the necessary DDT entry was not already available. The penalty of insufficient ram, in conjunction with dedup, is terrible write performance. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On Apr 27, 2011, at 9:26 PM, Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Neil Perrin >> >> No, that's not true. The DDT is just like any other ZFS metadata and can > be >> split over the ARC, >> cache device (L2ARC) and the main pool devices. An infrequently referenced >> DDT block will get >> evicted from the ARC to the L2ARC then evicted from the L2ARC. > > When somebody has their "baseline" system, and they're thinking about adding > dedup and/or cache, I'd like to understand the effect of not having enough > ram. Obviously the impact will be performance, but precisely... Pecision is only possible if you know what the data looks like... > At bootup, I presume the arc & l2arc are all empty. So all the DDT entries > reside in pool. As the system reads things (anything, files etc) from pool, > it will populate arc, and follow fill rate policies to populate the l2arc > over time. Every entry in l2arc requires 200 bytes of arc, regardless of > what type of entry it is. (A DDT entry in l2arc consumes just as much arc > memory as any other type of l2arc entry.) (Ummm... What's the point of > that? Aren't DDT entries 270 bytes and ARC references 200 bytes? No. The DDT entries vary in size. > Seems > like a very questionable benefit to allow DDT entries to get evicted into > L2ARC.) So the ram consumption caused by the presence of l2arc will > initially be zero after bootup, and it will grow over time as the l2arc > populates, up to a maximum which is determined linearly as 200 bytes * the > number of entries that can fit in the l2arc. Of course that number varies > based on the size of each entry and size of l2arc, but at least you can > estimate and establish upper and lower bounds. The upper and lower bounds vary by 256x, unless you know what the data looks like more precisely. > So that's how the l2arc consumes system memory in arc. The penalty of > insufficient ram, in conjunction with enabled L2ARC, is insufficient arc > availability for other purposes - Maybe the whole arc is consumed by l2arc > entries, and so the arc doesn't have any room for other stuff like commonly > used files. I've never seen this. > Worse yet, your arc consumption could be so large, that > PROCESSES don't fit in ram anymore. In this case, your processes get pushed > out to swap space, which is really bad. [for Solaris, illumos, and NexentaOS] This will not happen unless the ARC size is at arc_min. At that point you are already close to severe memory shortfall. > Correct me if I'm wrong, but the dedup sha256 checksum happens in addition > to (not instead of) the fletcher2 integrity checksum. You are mistaken. > So after bootup, > while the system is reading a bunch of data from the pool, all those reads > are not populating the arc/l2arc with DDT entries. Reads are just > populating the arc and l2arc with other stuff. L2ARC is populated by a separate thread that watches the to-be-evicted list. The L2ARC fill rate is also throttled, so that under severe shortfall, blocks will be evicted without being placed in the L2ARC. > DDT entries don't get into the arc/l2arc until something tries to do a > write. No, the DDT entry contains the references to the actual data. > When performing a write, dedup calculates the checksum of the block > to be written, and then it needs to figure out if that's a duplicate of > another block that's already on disk somewhere. So (I guess this part) > there's probably a tree-structure (I'll use the subdirectories and files > analogy even though I'm certain that's not technically correct) on disk. Implemented as an AVL tree. > You need to find the DDT entry, if it exists, for the block whose checksum > is 1234ABCD. So you start by looking under the 1 directory, and from there > look for the 2 subdirectory, and then the 3 subdirectory, [...etc...] If you > encounter "not found" at any step, then the DDT entry doesn't already exist > and you decide to create a new one. But if you get all the way down to the > C subdirectory and it contains a file named "D," then you have found a > possible dedup hit - the checksum matched another block that's already on > disk. Now the DDT entry is stored in ARC just like anything else you read > from disk. DDT is metadata, not data, so it is more constrained than data entries in the ARC. > So the point is - Whenever you do a write, and the calculated DDT is not > already in ARC/L2ARC, the system will actually perform several small reads > looking for the DDT entry before it finally knows that the DDT entry > actually exists. So the penalty of performing a write, with dedup enabled, > and the relevant DDT entry not already in ARC/L2ARC is a very large penalty. > What originated as a single write quickly became several small reads plus a > write, due to the fact the necessary DDT entry was not already available. > > The penalty of insuffici
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
OK, I just re-looked at a couple of things, and here's what I /think/ is the correct numbers. A single entry in the DDT is defined in the struct "ddt_entry" : http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108 I just checked, and the current size of this structure is 0x178, or 376 bytes. Each ARC entry, which points to either an L2ARC item (of any kind, cached data, metadata, or a DDT line) or actual data/metadata/etc., is defined in the struct "arc_buf_hdr" : http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#431 It's current size is 0xb0, or 176 bytes. These are fixed-size structures. PLEASE - someone correct me if these two structures AREN'T what we should be looking at. So, our estimate calculations have to be based on these new numbers. Back to the original scenario: 1TB (after dedup) of 4k blocks: how much space is needed for the DDT, and how much ARC space is needed if the DDT is kept in a L2ARC cache device? Step 1) 1TB (2^40 bytes) stored in blocks of 4k (2^12) = 2^28 blocks total, which is about 268 million. Step 2) 2^28 blocks of information in the DDT requires 376 bytes/block * 2^28 blocks = 94 * 2^30 = 94 GB of space. Step 3) Storing a reference to 268 million (2^28) DDT entries in the L2ARC will consume the following amount of ARC space: 176 bytes/entry * 2^28 entries = 44GB of RAM. That's pretty ugly. So, to summarize: For 1TB of data, broken into the following block sizes: DDT sizeARC consumption 512b752GB (73%) 352GB (34%) 4k 94GB (9%) 44GB (4.3%) 8k 47GB (4.5%) 22GB (2.1%) 32k 11.75GB (2.2%) 5.5GB (0.5%) 64k 5.9GB (1.1%)2.75GB (0.3%) 128k2.9GB% (0.6%) 1.4GB (0.1%) ARC consumption presumes the whole DDT is stored in the L2ARC. Percentage size is relative to the original 1TB total data size Of course, the trickier proposition here is that we DON'T KNOW what our dedup value is ahead of time on a given data set. That is, given a data set of X size, we don't know how big the deduped data size will be. The above calculations are for DDT/ARC size for a data set that has already been deduped down to 1TB in size. Perhaps it would be nice to have some sort of userland utility that builds it's own DDT as a test and does all the above calculations, to see how dedup would work on a given dataset. 'zdb -S' sorta, kinda does that, but... -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss