I've been thinking about this for awhile, but Anton's analysis makes me think about it even more:

We all love ZFS, right. It's futuristic in a bold new way, which many virtues, I won't preach tot he choir. But to make it all glue together has some necessary CPU/Memory intensive operations around checksum generation/validation, compression, encryption, data placement/component load balancing, etc. Processors have gotten really powerful, much more so than the relative disk I/O gains, which in all honesty make ZFS possible. My question: Is anyone working on an offload engine for ZFS? I can envision a highly optimized, pipelined system, where writes and reads pass through checksum, compression, encryption ASICs, that also locate data properly on disk. This could even be in the form of a PCIe SATA/SAS card with many ports, or different options. This would make direct IO, or DMA IO possible again. The file system abstraction with ZFS is really too much and too important to ignore, and too hard to optimize with different load conditions, (my rookie opinion) to expect any RDBMS app to have a clue what to do with it. I guess what I'm saying is the RDMBS app will know what blocks it needs, and wants to get them in and out speedy quick, but the mapping to disk is not linear with ZFS, the way other file systems are. An offload engine could translate this instead.

Just throwing this out there for the purpose of blue sky fluff.

Jon

Anton B. Rang wrote:
5) DMA straight from user buffer to disk avoiding a copy.

This is what the "direct" in "direct i/o" has historically meant.  :-)

line has been that 5) won't help latency much and
latency is here I think the game is currently played. Now the
disconnect might be because people might feel that the game
is not latency but CPU efficiency : "how many CPU cycles do I
burn to do get data from disk to user buffer".

Actually, it's less CPU cycles in many cases than memory cycles.

For many databases, most of the I/O is writes (reads wind up
cached in memory).  What's the cost of a write?

With direct I/O: CPU writes to memory (spread out over many
transactions), disk DMAs from memory.  We write LPS (log page size)
bytes of data from CPU to memory, we read LPS bytes from memory.
On processors without a cache line zero, we probably read the LPS
data from memory as part of the write.  Total cost = W:LPS, R:2*LPS.

Without direct I/O: The cost of getting the data into the user buffer
remains the same (W:LPS, R:LPS).  We copy the data from user buffer
to system buffer (W:LPS, R:LPS).  Then we push it out to disk.  Total
cost = W:2*LPS, R:3*LPS.  We've nearly doubled the cost, not including
any TLB effects.

On a memory-bandwidth-starved system (which should be nearly all
modern designs, especially with multi-threaded chips like Niagara),
replacing buffered I/O with direct I/O should give you nearly a 2x
improvement in log write bandwidth.  That's without considering
cache effects (which shouldn't be too significant, really, since LPS
should be << the size of L2).

How significant is this?  We'd have to measure; and it will likely
vary quite a lot depending on which database is used for testing.

But note that, for ZFS, the win with direct I/O will be somewhat
less.  That's because you still need to read the page to compute
its checksum.  So for direct I/O with ZFS (with checksums enabled),
the cost is W:LPS, R:2*LPS.  Is saving one page of writes enough to
make a difference?  Possibly not.

Anton
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--


-     _____/     _____/      /           - Jonathan Loran -           -
-    /          /           /                IT Manager               -
-  _____  /   _____  /     /     Space Sciences Laboratory, UC Berkeley
-        /          /     /      (510) 643-5146 [EMAIL PROTECTED]
- ______/    ______/    ______/           AST:7731^29u18e3

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to