Given the abysmal performance, I have to assume there is a significant number of "overhead" reads or writes in order to maintain the DDT for each "actual" block write operation. Something I didn't mention in the other email is that I also tracked iostat throughout the whole operation. It's all writes (or at least 99.9% writes.) So I am forced to conclude it's a bunch of small DDT maintenance writes taking place and incurring access time penalties in addition to each intended single block access time penalty.
The nature of the DDT is that it's a bunch of small blocks, that tend to be scattered randomly, and require maintenance in order to do anything else. This sounds like precisely the usage pattern that benefits from low latency devices such as SSD's. I understand the argument, DDT must be stored in the primary storage pool so you can increase the size of the storage pool without running out of space to hold the DDT... But it's a fatal design flaw as long as you care about performance... If you don't care about performance, you might as well use the netapp and do offline dedup. The point of online dedup is to gain performance. So in ZFS you have to care about the performance. There are only two possible ways to fix the problem. Either ... The DDT must be changed so it can be stored entirely in a designated sequential area of disk, and maintained entirely in RAM, so all DDT reads/writes can be infrequent and serial in nature... This would solve the case of async writes and large sync writes, but would still perform poorly for small sync writes. And it would be memory intensive. But it should perform very nicely given those limitations. ;-) Or ... The DDT stays as it is now, highly scattered small blocks, and there needs to be an option to store it entirely on low latency devices such as dedicated SSD's. Eliminate the need for the DDT to reside on the slow primary storage pool disks. I understand you must consider what happens when the dedicated SSD gets full. The obvious choices would be either (a) dedup turns off whenever the metadatadevice is full or (b) it defaults to writing blocks in the main storage pool. Maybe that could even be a configurable behavior. Either way, there's a very realistic use case here. For some people in some situations, it may be acceptable to say "I have 32G mirrored metadatadevice, divided by 137bytes per entry I can dedup up to a maximum 218M unique blocks in pool, and if I estimate 100K average block size that means up to 20T primary pool storage. If I reach that limit, I'll add more metadatadevice." Both of those options would also go a long way toward eliminating the "surprise" delete performance black hole. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss