Re: HDFS read/write data throttling

Jay Vyas Mon, 18 Nov 2013 10:47:42 -0800

Where is the jira for this?

Sent from my iPhone


> On Nov 18, 2013, at 1:25 PM, Andrew Wang <andrew.w...@cloudera.com> wrote:
> 
> Thanks for asking, here's a link:
> 
> http://www.umbrant.com/papers/socc12-cake.pdf
> 
> I don't think there's a recording of my talk unfortunately.
> 
> I'll also copy my comments over to the JIRA, though I'd like to not
> distract too much from what Lohit's trying to do.
> 
> 
> On Wed, Nov 13, 2013 at 2:54 AM, Steve Loughran <ste...@hortonworks.com>wrote:
> 
>> this is interesting -I've moved my comments over to the JIRA and it would
>> be good for yours to go there too.
>> 
>> is there a URL for your paper?
>> 
>> 
>>> On 13 November 2013 06:27, Andrew Wang <andrew.w...@cloudera.com> wrote:
>>> 
>>> Hey Steve,
>>> 
>>> My research project (Cake, published at SoCC '12) was trying to provide
>>> SLAs for mixed workloads of latency-sensitive and throughput-bound
>>> applications, e.g. HBase running alongside MR. This was challenging
>> because
>>> seeks are a real killer. Basically, we had to strongly limit MR I/O to
>> keep
>>> worst-case seek latency down, and did so by putting schedulers on the RPC
>>> queues in HBase and HDFS to restrict queuing in the OS and disk where we
>>> lacked preemption.
>>> 
>>> Regarding citations of note, most academics consider throughput-sharing
>> to
>>> be a solved problem. It's not dissimilar from normal time slicing, you
>> try
>>> to ensure fairness over some coarse timescale. I think cgroups [1] and
>>> ioprio_set [2] essentially provide this.
>>> 
>>> Mixing throughput and latency though is difficult, and my conclusion is
>>> that there isn't a really great solution for spinning disks besides
>>> physical isolation. As we all know, you can get either IOPS or bandwidth,
>>> but not both, and it's not a linear tradeoff between the two. If you're
>>> interested in this though, I can dig up some related work from my Cake
>>> paper.
>>> 
>>> However, since it seems that we're more concerned with throughput-bound
>>> apps, we might be okay just using cgroups and ioprio_set to do
>>> time-slicing. I actually hacked up some code a while ago which passed a
>>> client-provided priority byte to the DN, which used it to set the I/O
>>> priority of the handling DataXceiver accordingly. This isn't the most
>>> outlandish idea, since we've put QoS fields in our RPC protocol for
>>> instance; this would just be another byte. Short-circuit reads are
>> outside
>>> this paradigm, but then you can use cgroup controls instead.
>>> 
>>> My casual conversations with Googlers indicate that there isn't any
>> special
>>> Borg/Omega sauce either, just that they heavily prioritize DFS I/O over
>>> non-DFS. Maybe that's another approach: if we can separate block
>> management
>>> in HDFS, MR tasks could just write their output to a raw HDFS block, thus
>>> bringing a lot of I/O back into the fold of "datanode as I/O manager"
>> for a
>>> machine.
>>> 
>>> Overall, I strongly agree with you that it's important to first define
>> what
>>> our goals are regarding I/O QoS. The general case is a tarpit, so it'd be
>>> good to carve off useful things that can be done now (like Lohit's
>>> direction of per-stream/FS throughput throttling with trusted clients)
>> and
>>> then carefully grow the scope as we find more usecases we can confidently
>>> solve.
>>> 
>>> Best,
>>> Andrew
>>> 
>>> [1] cgroups blkio controller
>>> https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt
>>> [2] ioprio_set http://man7.org/linux/man-pages/man2/ioprio_set.2.html
>>> 
>>> 
>>> On Tue, Nov 12, 2013 at 1:38 AM, Steve Loughran <ste...@hortonworks.com
>>>> wrote:
>>> 
>>>> I've looked at it a bit within the context of YARN.
>>>> 
>>>> YARN containers are where this would be ideal, as then you'd be able to
>>>> request IO capacity as well as CPU and RAM. For that to work, the
>>>> throttling would have to be outside the App, as you are trying to limit
>>>> code whether or not it wants to be, and because you probably (*) want
>> to
>>>> give it more bandwidth if the system is otherwise idle. Self-throttling
>>>> doesn't pick up spare IO
>>>> 
>>>> 
>>>>   1. you can use cgroups in YARN to throttle local disk IO through the
>>>>   file:// URLs or the java filesystem APIs -such as for MR temp data
>>>>   2. you can't c-group throttle HDFS per YARN container, which would
>> be
>>>>   the ideal use case for it. The IO is taking place in the DN, and
>>> cgroups
>>>>   only limits IO in the throttled process group.
>>>>   3. implementing it in the DN would require a lot more complex code
>>> there
>>>>   to prioritise work based on block ID (sole identifier that goes
>> around
>>>>   everywhere) or input source (local sockets for HBase IO vs TCP
>> stack)
>>>>   4. One you go to a heterogenous filesystem you need to think about
>> IO
>>>>   load per storage layer as well as/alongside per-volume
>>>>   5. There's also generic RPC request throttle to prevent DoS against
>>> the
>>>>   NN and other HDFS services. That would need to be server side, but
>>> once
>>>>   implemented in the RPC code be universal.
>>>> 
>>>> You also need to define what is the load you are trying to throttle,
>> pure
>>>> RPCs/second, read bandwidth, write bandwidth, seeks or IOPs. Once a
>> file
>>> is
>>>> lined up for sequential reading, you'd almost want it to stream through
>>> the
>>>> next blocks until a high priority request came through, but operations
>>> like
>>>> a seek which would involve a disk head movement backwards would be
>>>> something to throttle (hence you need to be storage type aware as SSD
>>> seeks
>>>> costs less). You also need to consider that although the cost of writes
>>> is
>>>> high, it's usually being done with the goal of preserving data -and you
>>>> don't want to impact durability.
>>>> 
>>>> (*) probably, because that's one of the issues that causes debates in
>>> other
>>>> datacentre platforms, such as Google Omega: do you want max cluster
>>>> utilisation vs max determinism of workload.
>>>> 
>>>> If someone were to do IOP throttling in the 3.x+ timeline,
>>>> 
>>>>   1. It needs clear use cases, YARN containers being #1 for me
>>>>   2. We'd have to look at all the research done on this in the past to
>>> see
>>>>   what works, doesn't
>>>> 
>>>> Andrew, what citations of relevance do you have?
>>>> 
>>>> -steve
>>>> 
>>>> 
>>>>> On 12 November 2013 04:24, lohit <lohit.vijayar...@gmail.com> wrote:
>>>>> 
>>>>> 2013/11/11 Andrew Wang <andrew.w...@cloudera.com>
>>>>> 
>>>>>> Hey Lohit,
>>>>>> 
>>>>>> This is an interesting topic, and something I actually worked on in
>>>> grad
>>>>>> school before coming to Cloudera. It'd help if you could outline
>> some
>>>> of
>>>>>> your usecases and how per-FileSystem throttling would help. For
>> what
>>> I
>>>>> was
>>>>>> doing, it made more sense to throttle on the DN side since you
>> have a
>>>>>> better view over all the I/O happening on the system, and you have
>>>>>> knowledge of different volumes so you can set limits per-disk. This
>>>> still
>>>>>> isn't 100% reliable though since normally a portion of each disk is
>>>> used
>>>>>> for MR scratch space, which the DN doesn't have control over. I
>> tried
>>>>>> playing with thread I/O priorities here, but didn't see much
>>>> improvement.
>>>>>> Maybe the newer cgroups stuff can help out.
>>>>> 
>>>>> Thanks. Yes, we also thought about having something on DataNode. This
>>>> would
>>>>> also mean one could easily throttle client who access from outside
>> the
>>>>> cluster, for example distcp or hftp copies. Clients need not worry
>>> about
>>>>> throttle configs and each cluster can control how much much
>> throughput
>>>> can
>>>>> be achieved. We do want to have something like this.
>>>>> 
>>>>>> 
>>>>>> I'm sure per-FileSystem throttling will have some benefits (and
>>>> probably
>>>>> be
>>>>>> easier than some DN-side implementation) but again, it'd help to
>>> better
>>>>>> understand the problem you are trying to solve.
>>>>> 
>>>>> One idea was flexibility for client to override and have value they
>> can
>>>>> set. For on trusted cluster we could allow clients to go beyond
>> default
>>>>> value for some usecases. Alternatively we also thought about having
>>>> default
>>>>> value and max value where clients could change default, but not go
>>> beyond
>>>>> default. Another problem with DN side config is having different
>> values
>>>> for
>>>>> different clients and easily changing those for selective clients.
>>>>> 
>>>>> As, Haosong also suggested we could wrap
>> FSDataOutputStream/FSDataInput
>>>>> stream with ThrottleInputStream. But we might have to be careful of
>> any
>>>>> code which uses FileSystem APIs and accidentally throttling itself.
>>> (like
>>>>> reducer copy,  distributed cache and such...)
>>>>> 
>>>>> 
>>>>> 
>>>>>> Best,
>>>>>> Andrew
>>>>>> 
>>>>>> 
>>>>>> On Mon, Nov 11, 2013 at 6:16 PM, Haosong Huang <haosd...@gmail.com
>>> 
>>>>> wrote:
>>>>>> 
>>>>>>> Hi, lohit. There is a Class named
>>>>>>> ThrottledInputStream<
>> http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java
>>>>>>> in hadoop-distcp, you could check it out and find more details.
>>>>>>> 
>>>>>>> In addition to this, I am working on this and try to achieve
>>>> resources
>>>>>>> control(include CPU, Network, Disk IO) in JVM. But my
>>> implementation
>>>> is
>>>>>>> depends on cgroup, which only could run in Linux. I would push my
>>>>>>> library(java-cgroup) to github in the next several months. If you
>>> are
>>>>>>> interested at it, give my any advices and help me improve it
>>> please.
>>>>> :-)
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Nov 12, 2013 at 3:47 AM, lohit <
>> lohit.vijayar...@gmail.com
>>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Adam,
>>>>>>>> 
>>>>>>>> Thanks for the reply. The changes I was referring was in
>>>>>> FileSystem.java
>>>>>>>> layer which should not affect HDFS Replication/NameNode
>>> operations.
>>>>>>>> To give better idea this would affect clients something like
>> this
>>>>>>>> 
>>>>>>>> Configuration conf = new Configuration();
>>>>>>>> conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s
>>>>>>>> FileSystem fs = FileSystem.get(conf);
>>>>>>>> 
>>>>>>>> FSDataInputStream fis = fs.open("/path/to/file.xt");
>>>>>>>> fis.read(); // <-- This would be max of 20MB/s
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2013/11/11 Adam Muise <amu...@hortonworks.com>
>>>>>>>> 
>>>>>>>>> See https://issues.apache.org/jira/browse/HDFS-3475
>>>>>>>>> 
>>>>>>>>> Please note that this has met with many unexpected impacts on
>>>>>> workload.
>>>>>>>> Be
>>>>>>>>> careful and be mindful of your Datanode memory and network
>>>>> capacity.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Nov 11, 2013 at 1:59 PM, lohit <
>>>> lohit.vijayar...@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hello Devs,
>>>>>>>>>> 
>>>>>>>>>> Wanted to reach out and see if anyone has thought about
>>> ability
>>>>> to
>>>>>>>>> throttle
>>>>>>>>>> data transfer within HDFS. One option we have been thinking
>>> is
>>>> to
>>>>>>>>> throttle
>>>>>>>>>> on a per FileSystem basis, similar to Statistics in
>>> FileSystem.
>>>>>> This
>>>>>>>>> would
>>>>>>>>>> mean anyone with handle to HDFS/Hftp will be throttled
>>> globally
>>>>>>> within
>>>>>>>>> JVM.
>>>>>>>>>> Right value to come up for this would be based on type of
>>>>> hardware
>>>>>> we
>>>>>>>> use
>>>>>>>>>> and how many tasks/clients we allow.
>>>>>>>>>> 
>>>>>>>>>> On the other hand doing something like this at FileSystem
>>> layer
>>>>>> would
>>>>>>>>> mean
>>>>>>>>>> many other tasks such as Job jar copy, DistributedCache
>> copy
>>>> and
>>>>>> any
>>>>>>>>> hidden
>>>>>>>>>> data movement would also be throttled. We wanted to know if
>>>>> anyone
>>>>>>> has
>>>>>>>>> had
>>>>>>>>>> such requirement on their clusters in the past and what was
>>> the
>>>>>>>> thinking
>>>>>>>>>> around it. Appreciate your inputs/comments
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Have a Nice Day!
>>>>>>>>>> Lohit
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>>   * Adam Muise *       Solutions Engineer
>>>>>>>>> ------------------------------
>>>>>>>>> 
>>>>>>>>>    Phone:        416-417-4037
>>>>>>>>>  Email:      amu...@hortonworks.com
>>>>>>>>>  Website:   http://www.hortonworks.com/
>>>>>>>>> 
>>>>>>>>>      * Follow Us: *
>>>>>>>>> <
>> http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
>>>>>>>>> <
>> http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
>>>>>>>>> <
>> http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
>>>>>>>>> 
>>>>>>>>> [image: photo]
>>>>>>>>> 
>>>>>>>>>  Latest From Our Blog:  How to use R and other non-Java
>>>> languages
>>>>> in
>>>>>>>>> MapReduce and Hive
>>>>>>>>> <
>> http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>> NOTICE: This message is intended for the use of the
>> individual
>>> or
>>>>>>> entity
>>>>>>>> to
>>>>>>>>> which it is addressed and may contain information that is
>>>>>> confidential,
>>>>>>>>> privileged and exempt from disclosure under applicable law.
>> If
>>>> the
>>>>>>> reader
>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>> notified
>>>>>>>> that
>>>>>>>>> any printing, copying, dissemination, distribution,
>> disclosure
>>> or
>>>>>>>>> forwarding of this communication is strictly prohibited. If
>> you
>>>>> have
>>>>>>>>> received this communication in error, please contact the
>> sender
>>>>>>>> immediately
>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Have a Nice Day!
>>>>>>>> Lohit
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Haosdent Huang
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Have a Nice Day!
>>>>> Lohit
>>>> 
>>>> --
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or
>> entity
>>> to
>>>> which it is addressed and may contain information that is confidential,
>>>> privileged and exempt from disclosure under applicable law. If the
>> reader
>>>> of this message is not the intended recipient, you are hereby notified
>>> that
>>>> any printing, copying, dissemination, distribution, disclosure or
>>>> forwarding of this communication is strictly prohibited. If you have
>>>> received this communication in error, please contact the sender
>>> immediately
>>>> and delete it from your system. Thank You.
>> 
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>>

Re: HDFS read/write data throttling

Reply via email to