Given the abysmal performance, I have to assume there is a significant
number of "overhead" reads or writes in order to maintain the DDT for each
"actual" block write operation.  Something I didn't mention in the other
email is that I also tracked iostat throughout the whole operation.  It's
all writes (or at least 99.9% writes.)  So I am forced to conclude it's a
bunch of small DDT maintenance writes taking place and incurring access time
penalties in addition to each intended single block access time penalty.

The nature of the DDT is that it's a bunch of small blocks, that tend to be
scattered randomly, and require maintenance in order to do anything else.
This sounds like precisely the usage pattern that benefits from low latency
devices such as SSD's.

I understand the argument, DDT must be stored in the primary storage pool so
you can increase the size of the storage pool without running out of space
to hold the DDT...  But it's a fatal design flaw as long as you care about
performance...  If you don't care about performance, you might as well use
the netapp and do offline dedup.  The point of online dedup is to gain
performance.  So in ZFS you have to care about the performance.  

There are only two possible ways to fix the problem.  
Either ...
The DDT must be changed so it can be stored entirely in a designated
sequential area of disk, and maintained entirely in RAM, so all DDT
reads/writes can be infrequent and serial in nature...  This would solve the
case of async writes and large sync writes, but would still perform poorly
for small sync writes.  And it would be memory intensive.  But it should
perform very nicely given those limitations.  ;-)
Or ...
The DDT stays as it is now, highly scattered small blocks, and there needs
to be an option to store it entirely on low latency devices such as
dedicated SSD's.  Eliminate the need for the DDT to reside on the slow
primary storage pool disks.  I understand you must consider what happens
when the dedicated SSD gets full.  The obvious choices would be either (a)
dedup turns off whenever the metadatadevice is full or (b) it defaults to
writing blocks in the main storage pool.  Maybe that could even be a
configurable behavior.  Either way, there's a very realistic use case here.
For some people in some situations, it may be acceptable to say "I have 32G
mirrored metadatadevice, divided by 137bytes per entry I can dedup up to a
maximum 218M unique blocks in pool, and if I estimate 100K average block
size that means up to 20T primary pool storage.  If I reach that limit, I'll
add more metadatadevice."

Both of those options would also go a long way toward eliminating the
"surprise" delete performance black hole.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to