Hi Bikas, I completely agree with you in principle -- short circuit reads end up ceding control of the data path from the DataNode to the user applications. This has a few disadvantages which you've mentioned, and have been brought up in the JIRA as well: particularly QoS, metrics, the flexibility to change our data layout on disk in the future, etc.
However, the performance advantages of this approach are quite stark when the data sets have been cached in the OS buffer cache. For example, using a low-overhead client like Impala executing a simple table scan query, we've seen a 2x or more improvement in overall response time using short-circuit reads versus localhost TCP. The overhead comes primarily from the kernel layers, not from our own code -- eg localhost TCP connections still perform packet segmentation, enforce multiple buffer copies to and from kernel space, incur several syscalls, etc. A better implemented datanode, and perhaps doing transfer over domain sockets might close the gap a bit, but based on all of my benchmarks, it will still be ~50% slower than short circuit. If you look at the history of HDFS-347, I actually asked Colin to implement and experiment with a non-short-circuit path over domain sockets, under the assumption that they may be more efficient than loopback TCP sockets. The results weren't particularly encouraging, though it may still be enabled for anyone who wants to experiment with optimizing it further. There are also some improvements coming down the road in the Linux kernel (in particular "TCP friends") which can eliminate some of the TCP stack overhead for loopback connections, but unfortunately they're several years off for those of us deploying on mainstream distros. Most of the above is in reference to sequential throughput. Random IO performance is even more drastically effected - the benchmarks I posted on HDFS-347 show a 3-4x improvement in some workloads when the data is in the buffer cache. As the RAM capacities of our machines continue to increase, and as solid state storage becomes more cost effective, more and more random reads fall into this category where they're not bound by the hardware, but rather bound by our software overhead. Given all of the above, I think the performance benefits of short circuit read outweigh the disadvantages. Given that it is entirely an implementation optimization, and not an API, we can always re-evaluate in future versions, if either someone figures out a way to get a non-short-circuit implementation to comparable performance, or if the kernel guys catch up and implement TCP friends and other features which close the gap. Colin has also been careful to build in capability in the API for the datanode to reject a short circuit request, causing a client to seamlessly fall back, based on a version number. This would allow us to change the underlying format on DNs to something which isn't SCR-friendly, without causing any incompatibility in existing clients, etc. Hope the above explains the motivation for the feature. Thanks -Todd On Tue, Feb 26, 2013 at 1:47 PM, Bikas Saha <bi...@hortonworks.com> wrote: > Hi, > > In my opinion, this feature of short circuit reads (HDFS-347 or HDFS-2246) > is not a desirable feature for HDFS. We should be working towards removing > this feature instead of enhancing it and making it popular. > > Maybe short-circuit reads were something that HBase needed for performance > at a point in time when HDFS performance was slow. But after all the > improvements that have been made, is it still unacceptably slow to read > from HDFS? Is there more good engineering that we can do to close that > gap? Close it for all HDFS users and not just the ones who use > short-circuit reads? > Which brings me to the question - Who is the target audience for this > feature? From what I see, anyone who potentially wants to use it == > everyone. Now if everyone starts using short circuit reads what happens to > the performance problem that we are trying to solve? Will performance > still be better then? This is especially important in the context of YARN > where we don't control the apps that run on the shared grid. > > What problem are we trying to solve here? If we want better HDFS > performance and QOS for services then we want to give as much control over > the disk to HDFS rather than take it away. Short circuit reads leave a > gaping hole towards that end and making short circuit reads better and > easier to use makes that hole larger. > > I am sorry for replying late and also because my response might be missing > historical perspectives that I am not aware of. > > Bikas > > -----Original Message----- > From: rarecac...@gmail.com [mailto:rarecac...@gmail.com] On Behalf Of > Colin McCabe > Sent: Sunday, February 17, 2013 1:49 PM > To: hdfs-dev@hadoop.apache.org > Subject: VOTE: HDFS-347 merge > > Hi all, > > I would like to merge the HDFS-347 branch back to trunk. It's been under > intensive review and testing for several months. The branch adds a lot of > new unit tests, and passes Jenkins as of 2/15 [1] > > We have tested HDFS-347 with both random and sequential workloads. The > short-circuit case is substantially faster [2], and overall performance > looks very good. This is especially encouraging given that the initial > goal of this work was to make security compatible with short-circuit local > reads, rather than to optimize the short-circuit code path. We've also > stress-tested HDFS-347 on a number of clusters. > > This iniial VOTE is to merge only into trunk. Just as we have done with > our other recent merges, we will consider merging into branch-2 after the > code has been in trunk for few weeks. > > Please cast your vote by EOD Sunday 2/24. > > best, > Colin McCabe > > [1] > https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13579704&p > age=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comme > nt-13579704 > > [2] > https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13551755&p > age=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comme > nt-13551755 > -- Todd Lipcon Software Engineer, Cloudera