Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

Ray Van Dolson Fri, 30 Dec 2011 13:03:17 -0800

Thanks for you response, Richard.

On Fri, Dec 30, 2011 at 09:52:17AM -0800, Richard Elling wrote:
> On Dec 29, 2011, at 10:31 PM, Ray Van Dolson wrote:
> 
> > Hi all;
> > 
> > We have a dev box running NexentaStor Community Edition 3.1.1 w/ 24GB
> > (we don't run dedupe on production boxes -- and we do pay for Nexenta
> > licenses on prd as well) RAM and an 8.5TB pool with deduplication
> > enabled (1.9TB or so in use).  Dedupe ratio is only 1.26x.
> 
> Yes, this workload is a poor fit for dedup.
> 
> > The box has an SLC-based SSD as ZIL and a 300GB MLC SSD as L2ARC.
> > 
> > The box has been performing fairly poorly lately, and we're thinking
> > it's due to deduplication:
> > 
> >  # echo "::arc" | mdb -k | grep arc_meta
> >  arc_meta_used             =      5884 MB
> >  arc_meta_limit            =      5885 MB
> 
> This can be tuned. Since you are on the community edition and thus have no 
> expectation of support, you can increase this limit yourself. In the future, 
> the
> limit will be increased OOB. For now, add something like the following to the
> /etc/system file and reboot.
> 
> *** Parameter: zfs:zfs_arc_meta_limit
> ** Description: sets the maximum size of metadata stored in the ARC.
> **   Metadata competes with real data for ARC space.
> ** Release affected: NexentaStor 3.0, 3.1, not needed for 4.0
> ** Validation: none
> ** When to change: for metadata-intensive or deduplication workloads
> **   having more metadata in the ARC can improve performance.
> ** Stability: NexentaStor issue #7151 seeks to change the default 
> **   value to be larger than 1/4 of arc_max.
> ** Data type: integer
> ** Default: 1/4 of arc_max (bytes)
> ** Range: 10000 to arc_max
> ** Changed by: YOUR_NAME_HERE
> ** Change date: TODAYS_DATE
> **
> *set zfs:zfs_arc_meta_limit = 10000000


If we wanted to this on a running system, would the following work?

  # echo "arc_meta_limit/Z 0x271000000" | mdb -kw

(To up arc_meta_limit to 10GB)

> 
> 
> >  arc_meta_max              =      5888 MB
> > 
> >  # zpool status -D
> >  ...
> >  DDT entries 24529444, size 331 on disk, 185 in core
> > 
> > So, not only are we using up all of our metadata cache, but the DDT
> > table is taking up a pretty significant chunk of that (over 70%).
> > 
> > ARC sizing is as follows:
> > 
> >  p                         =     15331 MB
> >  c                         =     16354 MB
> >  c_min                     =      2942 MB
> >  c_max                     =     23542 MB
> >  size                      =     16353 MB
> > 
> > I'm not really sure how to determine how many blocks are on this zpool
> > (is it the same as the # of DDT entries? -- deduplication has been on
> > since pool creation).  If I use a 64KB block size average, I get about
> > 31 million blocks, but DDT entries are 24 million ….
> 
> The zpool status -D output shows the number of blocks.
> 
> > zdb -DD and zdb -bb | grep 'bp count" both do not complete (zdb says
> > I/O error).  Probably because the pool is in use and is quite busy.
> 
> Yes, zdb is not expected to produce correct output for imported pools.
> 
> > Without the block count I'm having a hard time determining how much
> > memory we _should_ have.  I can only speculate that it's "more" at this
> > point. :)
> > 
> > If I assume 24 million blocks is about accurate (from zpool status -D
> > output above), then at 320 bytes per block we're looking at about 7.1GB
> > for DDT table size.  
> 
> That is the on-disk calculation. Use the in-core number for memory 
> consumption.
>       RAM needed if DDT is completely in ARC = 4,537,947,140 bytes (+)
> 
> > We do have L2ARC, though I'm not sure how ZFS
> > decides what portion of the DDT stays in memory and what can go to
> > L2ARC -- if all of it went to L2ARC, then the references to this
> > information in arc_meta would be (at 176 bytes * 24million blocks)
> > around 4GB -- which again is a good chuck of arc_meta_max.
> 
> Some of the data might already be in L2ARC. But L2ARC access is always
> slower than RAM access by a few orders of magnitude.
> 
> > Given that our dedupe ratio on this pool is fairly low anyways, am
> > looking for strategies to back out.  Should we just disable
> > deduplication and then maybe bump up the size of the arc_meta_max?
> > Maybe also increase the size of arc.size as well (8GB left for the
> > system seems higher than we need)?
> 
> The arc_size is dynamic, but limited by another bug in Solaris to effectively 
> 7/8
> of RAM (fixed in illumos). Since you are unsupported, you can try to add the
> following to /etc/system along with the tunable above.
> 
> *** Parameter: swapfs_minfree
> ** Description: sets the minimum space reserved for the rest of the
> **   system as swapfs grows. This value is also used to calculate the
> **   dynamic upper limit of the ARC size.
> ** Release affected: NexentaStor 3.0, 3.1, not needed for 4.0
> ** Validation: none
> ** When to change: the default setting of physmem/8 caps the ARC to
> **   approximately 7/8 of physmem, a value usually much smaller than
> **   arc_max. Choosing a lower limit for swapfs_minfree can allow the
> **   ARC to grow above 7/8 of physmem.
> ** Data type: unsigned integer (pages)
> ** Default: 1/8 of physmem
> ** Range: clamped at 256MB (65,536 4KB pages) for NexentaStor 4.0
> ** Changed by: YOUR_NAME_HERE
> ** Change date: TODAYS_DATE
> **
> *set swapfs_minfree=65536
> 
> > 
> > Is there a non-disruptive way to undeduplicate everything and expunge
> > the DDT?
> 
> define "disruptive"
> 
> >  zfs send/recv and then back perhaps (we have the extra
> > space)?
> 
> send/receive is the most cost-effective way.
>  -- richard

I think we will give this method a shot.

Thanks,
Ray
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

Reply via email to