Re: VOTE: HDFS-347 merge

Todd Lipcon Tue, 26 Feb 2013 14:08:34 -0800

Hi Bikas,

I completely agree with you in principle -- short circuit reads end up
ceding control of the data path from the DataNode to the user applications.
This has a few disadvantages which you've mentioned, and have been brought
up in the JIRA as well: particularly QoS, metrics, the flexibility to
change our data layout on disk in the future, etc.

However, the performance advantages of this approach are quite stark when
the data sets have been cached in the OS buffer cache. For example, using a
low-overhead client like Impala executing a simple table scan query, we've
seen a 2x or more improvement in overall response time using short-circuit
reads versus localhost TCP. The overhead comes primarily from the kernel
layers, not from our own code -- eg localhost TCP connections still perform
packet segmentation, enforce multiple buffer copies to and from kernel
space, incur several syscalls, etc. A better implemented datanode, and
perhaps doing transfer over domain sockets might close the gap a bit, but
based on all of my benchmarks, it will still be ~50% slower than short
circuit.

If you look at the history of HDFS-347, I actually asked Colin to implement
and experiment with a non-short-circuit path over domain sockets, under the
assumption that they may be more efficient than loopback TCP sockets. The
results weren't particularly encouraging, though it may still be enabled
for anyone who wants to experiment with optimizing it further. There are
also some improvements coming down the road in the Linux kernel (in
particular "TCP friends") which can eliminate some of the TCP stack
overhead for loopback connections, but unfortunately they're several years
off for those of us deploying on mainstream distros.

Most of the above is in reference to sequential throughput. Random IO
performance is even more drastically effected - the benchmarks I posted on
HDFS-347 show a 3-4x improvement in some workloads when the data is in the
buffer cache. As the RAM capacities of our machines continue to increase,
and as solid state storage becomes more cost effective, more and more
random reads fall into this category where they're not bound by the
hardware, but rather bound by our software overhead.

Given all of the above, I think the performance benefits of short circuit
read outweigh the disadvantages. Given that it is entirely an
implementation optimization, and not an API, we can always re-evaluate in
future versions, if either someone figures out a way to get a
non-short-circuit implementation to comparable performance, or if the
kernel guys catch up and implement TCP friends and other features which
close the gap. Colin has also been careful to build in capability in the
API for the datanode to reject a short circuit request, causing a client to
seamlessly fall back, based on a version number. This would allow us to
change the underlying format on DNs to something which isn't SCR-friendly,
without causing any incompatibility in existing clients, etc.

Hope the above explains the motivation for the feature.

Thanks
-Todd

On Tue, Feb 26, 2013 at 1:47 PM, Bikas Saha <bi...@hortonworks.com> wrote:

> Hi,
>
> In my opinion, this feature of short circuit reads (HDFS-347 or HDFS-2246)
> is not a desirable feature for HDFS. We should be working towards removing
> this feature instead of enhancing it and making it popular.
>
> Maybe short-circuit reads were something that HBase needed for performance
> at a point in time when HDFS performance was slow. But after all the
> improvements that have been made, is it still unacceptably slow to read
> from HDFS? Is there more good engineering that we can do to close that
> gap? Close it for all HDFS users and not just the ones who use
> short-circuit reads?
> Which brings me to the question - Who is the target audience for this
> feature? From what I see, anyone who potentially wants to use it ==
> everyone. Now if everyone starts using short circuit reads what happens to
> the performance problem that we are trying to solve? Will performance
> still be better then? This is especially important in the context of YARN
> where we don't control the apps that run on the shared grid.
>
> What problem are we trying to solve here? If we want better HDFS
> performance and QOS for services then we want to give as much control over
> the disk to HDFS rather than take it away. Short circuit reads leave a
> gaping hole towards that end and making short circuit reads better and
> easier to use makes that hole larger.
>
> I am sorry for replying late and also because my response might be missing
> historical perspectives that I am not aware of.
>
> Bikas
>
> -----Original Message-----
> From: rarecac...@gmail.com [mailto:rarecac...@gmail.com] On Behalf Of
> Colin McCabe
> Sent: Sunday, February 17, 2013 1:49 PM
> To: hdfs-dev@hadoop.apache.org
> Subject: VOTE: HDFS-347 merge
>
> Hi all,
>
> I would like to merge the HDFS-347 branch back to trunk.  It's been under
> intensive review and testing for several months.  The branch adds a lot of
> new unit tests, and passes Jenkins as of 2/15 [1]
>
> We have tested HDFS-347 with both random and sequential workloads. The
> short-circuit case is substantially faster [2], and overall performance
> looks very good.  This is especially encouraging given that the initial
> goal of this work was to make security compatible with short-circuit local
> reads, rather than to optimize the short-circuit code path.  We've also
> stress-tested HDFS-347 on a number of clusters.
>
> This iniial VOTE is to merge only into trunk.  Just as we have done with
> our other recent merges, we will consider merging into branch-2 after the
> code has been in trunk for few weeks.
>
> Please cast your vote by EOD Sunday 2/24.
>
> best,
> Colin McCabe
>
> [1]
> https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13579704&p
> age=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comme
> nt-13579704
>
> [2]
> https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13551755&p
> age=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comme
> nt-13551755
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: VOTE: HDFS-347 merge

Reply via email to