Hi Roshan,

Thanks for the reply. Sorry I worded the first question wrong and confused
sources with sinks. What I meant to ask was:
1. Are the batches from flume Avro Sink sent to the Avro Source on the next
machine in a pipelined fasion or is the next batch only sent once an ack
for previous batch is received?

Overall it sounds like adding more sinks would provide more concurrency.
I'm going to try that.

About the large batch size, in our use case it won't be a big issue as long
as we can set a timeout after which whatever events are accumulated are
sent without requiring the batch to be full. Does such a setting exist?

Thanks,
Pankaj




On Fri, Jul 26, 2013 at 10:59 AM, Roshan Naik <[email protected]>wrote:

> could you provide a sample of the config you are using ?
>
>
>    1. Are the batches from flume source sent to the sink in a pipelined
>    fasion or is the next batch only sent once an ack for previous batch is
>    received?
>
> Source does not send to sink directly. Source dumps a batch of events into
> the channel... and the sink picks it form the channel in batches and writes
> them to destination. Sink fetches a batch from channel and writes to
> destination and then fetches the next batch from channel.. and the cycle
> continues.
>
>
>    1. If the batch send is not pipelined then would increasing the number
>    of sinks draining from the channel help.
>    The idea behind this is to basically achieve pipelining by having
>    multiple outstanding requests and thus use network better.
>
> Increasing the number of sinks will increase concurrency.
>
>
>    1. If batch size is very large, e.g. 1 million, would the batch only
>    be sent once that many events have accumulated or is there a time limit
>    after which whatever events are accumulated are sent? Is this timelimit
>    configurable? (I looked in the Avro Sink documentation for such a setting:
>    http://flume.apache.org/FlumeUserGuide.html, but couldn't find
>    anything, hence asking the question)
>
> IMO...Not a good idea to have such a large batch.. esp if you like to have
> concurrent sinks. each sink will need to wait for 1mill events to close the
> transactions on the channel.
>
>
>    1. Does enabling ssl have any significant impact on throughput?
>    Increase in latency is expected but does this also affect throughput.
>
> perhaps somebody can comment on this.
>
> -roshan
>
>
>
> On Fri, Jul 26, 2013 at 12:34 AM, Derek Chan <[email protected]> wrote:
>
>>  We have a similar setup (Flume 1.3) and same problems here. Increasing
>> the batch size did not help much but setting up multiple AvroSinks did.
>>
>>
>> On 26/7/2013 9:31, Pankaj Gupta wrote:
>>
>> Hi,
>>
>>  We are trying to figure out how to get better throughput in our flume
>> pipeline. We have flume instances on a lot of machines writing to a few
>> collector machines running with a File Channel which in turn write to still
>> fewer hdfs writer machines running with a File Channel and HDFS Sinks.
>>
>>  The problem that we're facing is that we are not getting good network
>> usage between our flume collector machines and hdfs writer machines. The
>> way these machines are connected is that the filechannel on collector
>> drains to an Avro Sink which sends to Avro Source on the writer machine,
>> which in turn writes to a filechannel draining into an HDFS Sink. So:
>>
>>  [FileChannel -> Avro Sink] -> [Avro Source -> FileChannel -> HDFS Sink]
>>
>>  I did a raw network throughput test(using netcat on the command line)
>> between the collector and the writer and saw a throughput of ~*
>> 200Megabits*/sec. Whereas the network throughput  (which I observed
>> using iftop) between collector avro sink and writer avro source never went
>> over *25Megabits*/sec, even when the filechannel on the collector was
>> quite full with millions of events queued up. We obviously want to use the
>> network better and I am exploring ways of achieving that. The batch size we
>> are using on avro sink on the collector is 4000.
>>
>>  I have a few questions regarding how AvroSource and Sink work together
>> to help me improve the throughput and will really appreciate a response:
>>
>>    1. Are the batches from flume source sent to the sink in a pipelined
>>    fasion or is the next batch only sent once an ack for previous batch is
>>    received?
>>    2. If the batch send is not pipelined then would increasing the
>>    number of sinks draining from the channel help.
>>    The idea behind this is to basically achieve pipelining by having
>>    multiple outstanding requests and thus use network better.
>>    3. If batch size is very large, e.g. 1 million, would the batch only
>>    be sent once that many events have accumulated or is there a time limit
>>    after which whatever events are accumulated are sent? Is this timelimit
>>    configurable? (I looked in the Avro Sink documentation for such a setting:
>>    http://flume.apache.org/FlumeUserGuide.html, but couldn't find
>>    anything, hence asking the question)
>>     4. Does enabling ssl have any significant impact on throughput?
>>    Increase in latency is expected but does this also affect throughput.
>>
>> We are using flume 1.4.0.
>>
>>  Thanks in Advance,
>> Pankaj
>>
>>  --
>>
>>
>>  *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 |
>> [email protected]
>>
>> Pankaj Gupta | Software Engineer
>>
>> *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com
>>
>>
>>  United States | Canada | United Kingdom | Germany
>>
>>
>>  We're 
>> hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
>> !
>>
>>
>>
>


-- 


*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [email protected]

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com


United States | Canada | United Kingdom | Germany


We're 
hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!

Reply via email to