Re: Configuring flume for better throughput

Pankaj Gupta Fri, 26 Jul 2013 14:13:42 -0700

Here is the flume config of the collector machine. The File channel is
drained by 4 flume sinks that send messages to a separate hdfs-writer
machine.



agent1.channels.ch1.type = FILE
agent1.channels.ch1.checkpointDir = /flume1/checkpoint
agent1.channels.ch1.dataDirs = /flume1/data
agent1.channels.ch1.maxFileSize = 375809638400
agent1.channels.ch1.capacity = 75000000
agent1.channels.ch1.transactionCapacity = 4000

agent1.sources.avroSource1.channels = ch1
agent1.sources.avroSource1.type = avro
agent1.sources.avroSource1.bind = 0.0.0.0
agent1.sources.avroSource1.port = 4545
agent1.sources.avroSource1.threads = 16

agent1.sinks.avroSink1-1.type = avro
agent1.sinks.avroSink1-1.channel = ch1
agent1.sinks.avroSink1-1.hostname = hdfs-writer-machine-a.mydomain.com
agent1.sinks.avroSink1-1.port = 4545
agent1.sinks.avroSink1-1.connect-timeout = 300000
agent1.sinks.avroSink1-1.batch-size = 4000

agent1.sinks.avroSink1-2.type = avro
agent1.sinks.avroSink1-2.channel = ch1
agent1.sinks.avroSink1-2.hostname = hdfs-writer-machine-b.mydomain.com
agent1.sinks.avroSink1-2.port = 4545
agent1.sinks.avroSink1-2.connect-timeout = 300000
agent1.sinks.avroSink1-2.batch-size = 4000

agent1.sinks.avroSink1-3.type = avro
agent1.sinks.avroSink1-3.channel = ch1
agent1.sinks.avroSink1-3.hostname = hdfs-writer-machine-c.mydomain.com
agent1.sinks.avroSink1-3.port = 4545
agent1.sinks.avroSink1-3.connect-timeout = 300000
agent1.sinks.avroSink1-3.batch-size = 4000

agent1.sinks.avroSink1-4.type = avro
agent1.sinks.avroSink1-4.channel = ch1
agent1.sinks.avroSink1-4.hostname = hdfs-writer-machine-d.mydomain.com
agent1.sinks.avroSink1-4.port = 4545
agent1.sinks.avroSink1-4.connect-timeout = 300000
agent1.sinks.avroSink1-4.batch-size = 4000


#Add the sink groups; load-balance between each group of sinks which round
robin between different hops
agent1.sinkgroups = group1
agent1.sinkgroups.group1.sinks = avroSink1-1 avroSink1-2 avroSink1-3
avroSink1-4
agent1.sinkgroups.group1.processor.type = load_balance
agent1.sinkgroups.group1.processor.selector = ROUND_ROBIN
agent1.sinkgroups.group1.processor.backoff = true



On Fri, Jul 26, 2013 at 1:38 PM, Pankaj Gupta <[email protected]> wrote:

> Hi Roshan,
>
> Thanks for the reply. Sorry I worded the first question wrong and confused
> sources with sinks. What I meant to ask was:
> 1. Are the batches from flume Avro Sink sent to the Avro Source on the
> next machine in a pipelined fasion or is the next batch only sent once an
> ack for previous batch is received?
>
> Overall it sounds like adding more sinks would provide more concurrency.
> I'm going to try that.
>
> About the large batch size, in our use case it won't be a big issue as
> long as we can set a timeout after which whatever events are accumulated
> are sent without requiring the batch to be full. Does such a setting exist?
>
> Thanks,
> Pankaj
>
>
>
>
> On Fri, Jul 26, 2013 at 10:59 AM, Roshan Naik <[email protected]>wrote:
>
>> could you provide a sample of the config you are using ?
>>
>>
>>    1. Are the batches from flume source sent to the sink in a pipelined
>>    fasion or is the next batch only sent once an ack for previous batch is
>>    received?
>>
>> Source does not send to sink directly. Source dumps a batch of events
>> into the channel... and the sink picks it form the channel in batches and
>> writes them to destination. Sink fetches a batch from channel and writes to
>> destination and then fetches the next batch from channel.. and the cycle
>> continues.
>>
>>
>>    1. If the batch send is not pipelined then would increasing the
>>    number of sinks draining from the channel help.
>>    The idea behind this is to basically achieve pipelining by having
>>    multiple outstanding requests and thus use network better.
>>
>> Increasing the number of sinks will increase concurrency.
>>
>>
>>    1. If batch size is very large, e.g. 1 million, would the batch only
>>    be sent once that many events have accumulated or is there a time limit
>>    after which whatever events are accumulated are sent? Is this timelimit
>>    configurable? (I looked in the Avro Sink documentation for such a setting:
>>    http://flume.apache.org/FlumeUserGuide.html, but couldn't find
>>    anything, hence asking the question)
>>
>> IMO...Not a good idea to have such a large batch.. esp if you like to
>> have concurrent sinks. each sink will need to wait for 1mill events to
>> close the transactions on the channel.
>>
>>
>>    1. Does enabling ssl have any significant impact on throughput?
>>    Increase in latency is expected but does this also affect throughput.
>>
>> perhaps somebody can comment on this.
>>
>> -roshan
>>
>>
>>
>> On Fri, Jul 26, 2013 at 12:34 AM, Derek Chan <[email protected]> wrote:
>>
>>>  We have a similar setup (Flume 1.3) and same problems here. Increasing
>>> the batch size did not help much but setting up multiple AvroSinks did.
>>>
>>>
>>> On 26/7/2013 9:31, Pankaj Gupta wrote:
>>>
>>> Hi,
>>>
>>>  We are trying to figure out how to get better throughput in our flume
>>> pipeline. We have flume instances on a lot of machines writing to a few
>>> collector machines running with a File Channel which in turn write to still
>>> fewer hdfs writer machines running with a File Channel and HDFS Sinks.
>>>
>>>  The problem that we're facing is that we are not getting good network
>>> usage between our flume collector machines and hdfs writer machines. The
>>> way these machines are connected is that the filechannel on collector
>>> drains to an Avro Sink which sends to Avro Source on the writer machine,
>>> which in turn writes to a filechannel draining into an HDFS Sink. So:
>>>
>>>  [FileChannel -> Avro Sink] -> [Avro Source -> FileChannel -> HDFS Sink]
>>>
>>>  I did a raw network throughput test(using netcat on the command line)
>>> between the collector and the writer and saw a throughput of ~*
>>> 200Megabits*/sec. Whereas the network throughput  (which I observed
>>> using iftop) between collector avro sink and writer avro source never went
>>> over *25Megabits*/sec, even when the filechannel on the collector was
>>> quite full with millions of events queued up. We obviously want to use the
>>> network better and I am exploring ways of achieving that. The batch size we
>>> are using on avro sink on the collector is 4000.
>>>
>>>  I have a few questions regarding how AvroSource and Sink work together
>>> to help me improve the throughput and will really appreciate a response:
>>>
>>>    1. Are the batches from flume source sent to the sink in a pipelined
>>>    fasion or is the next batch only sent once an ack for previous batch is
>>>    received?
>>>    2. If the batch send is not pipelined then would increasing the
>>>    number of sinks draining from the channel help.
>>>    The idea behind this is to basically achieve pipelining by having
>>>    multiple outstanding requests and thus use network better.
>>>    3. If batch size is very large, e.g. 1 million, would the batch only
>>>    be sent once that many events have accumulated or is there a time limit
>>>    after which whatever events are accumulated are sent? Is this timelimit
>>>    configurable? (I looked in the Avro Sink documentation for such a 
>>> setting:
>>>    http://flume.apache.org/FlumeUserGuide.html, but couldn't find
>>>    anything, hence asking the question)
>>>     4. Does enabling ssl have any significant impact on throughput?
>>>    Increase in latency is expected but does this also affect throughput.
>>>
>>> We are using flume 1.4.0.
>>>
>>>  Thanks in Advance,
>>> Pankaj
>>>
>>>  --
>>>
>>>
>>>  *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 |
>>> [email protected]
>>>
>>> Pankaj Gupta | Software Engineer
>>>
>>> *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com
>>>
>>>
>>>  United States | Canada | United Kingdom | Germany
>>>
>>>
>>>  We're 
>>> hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
>>> !
>>>
>>>
>>>
>>
>
>
> --
>
>
> *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [email protected]
>
> Pankaj Gupta | Software Engineer
>
> *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com
>
>
> United States | Canada | United Kingdom | Germany
>
>
> We're 
> hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
> !
>



-- 


*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [email protected]

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com


United States | Canada | United Kingdom | Germany


We're 
hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!

Re: Configuring flume for better throughput

Reply via email to