Here is the flume config of the collector machine. The File channel is drained by 4 flume sinks that send messages to a separate hdfs-writer machine.
agent1.channels.ch1.type = FILE agent1.channels.ch1.checkpointDir = /flume1/checkpoint agent1.channels.ch1.dataDirs = /flume1/data agent1.channels.ch1.maxFileSize = 375809638400 agent1.channels.ch1.capacity = 75000000 agent1.channels.ch1.transactionCapacity = 4000 agent1.sources.avroSource1.channels = ch1 agent1.sources.avroSource1.type = avro agent1.sources.avroSource1.bind = 0.0.0.0 agent1.sources.avroSource1.port = 4545 agent1.sources.avroSource1.threads = 16 agent1.sinks.avroSink1-1.type = avro agent1.sinks.avroSink1-1.channel = ch1 agent1.sinks.avroSink1-1.hostname = hdfs-writer-machine-a.mydomain.com agent1.sinks.avroSink1-1.port = 4545 agent1.sinks.avroSink1-1.connect-timeout = 300000 agent1.sinks.avroSink1-1.batch-size = 4000 agent1.sinks.avroSink1-2.type = avro agent1.sinks.avroSink1-2.channel = ch1 agent1.sinks.avroSink1-2.hostname = hdfs-writer-machine-b.mydomain.com agent1.sinks.avroSink1-2.port = 4545 agent1.sinks.avroSink1-2.connect-timeout = 300000 agent1.sinks.avroSink1-2.batch-size = 4000 agent1.sinks.avroSink1-3.type = avro agent1.sinks.avroSink1-3.channel = ch1 agent1.sinks.avroSink1-3.hostname = hdfs-writer-machine-c.mydomain.com agent1.sinks.avroSink1-3.port = 4545 agent1.sinks.avroSink1-3.connect-timeout = 300000 agent1.sinks.avroSink1-3.batch-size = 4000 agent1.sinks.avroSink1-4.type = avro agent1.sinks.avroSink1-4.channel = ch1 agent1.sinks.avroSink1-4.hostname = hdfs-writer-machine-d.mydomain.com agent1.sinks.avroSink1-4.port = 4545 agent1.sinks.avroSink1-4.connect-timeout = 300000 agent1.sinks.avroSink1-4.batch-size = 4000 #Add the sink groups; load-balance between each group of sinks which round robin between different hops agent1.sinkgroups = group1 agent1.sinkgroups.group1.sinks = avroSink1-1 avroSink1-2 avroSink1-3 avroSink1-4 agent1.sinkgroups.group1.processor.type = load_balance agent1.sinkgroups.group1.processor.selector = ROUND_ROBIN agent1.sinkgroups.group1.processor.backoff = true On Fri, Jul 26, 2013 at 1:38 PM, Pankaj Gupta <[email protected]> wrote: > Hi Roshan, > > Thanks for the reply. Sorry I worded the first question wrong and confused > sources with sinks. What I meant to ask was: > 1. Are the batches from flume Avro Sink sent to the Avro Source on the > next machine in a pipelined fasion or is the next batch only sent once an > ack for previous batch is received? > > Overall it sounds like adding more sinks would provide more concurrency. > I'm going to try that. > > About the large batch size, in our use case it won't be a big issue as > long as we can set a timeout after which whatever events are accumulated > are sent without requiring the batch to be full. Does such a setting exist? > > Thanks, > Pankaj > > > > > On Fri, Jul 26, 2013 at 10:59 AM, Roshan Naik <[email protected]>wrote: > >> could you provide a sample of the config you are using ? >> >> >> 1. Are the batches from flume source sent to the sink in a pipelined >> fasion or is the next batch only sent once an ack for previous batch is >> received? >> >> Source does not send to sink directly. Source dumps a batch of events >> into the channel... and the sink picks it form the channel in batches and >> writes them to destination. Sink fetches a batch from channel and writes to >> destination and then fetches the next batch from channel.. and the cycle >> continues. >> >> >> 1. If the batch send is not pipelined then would increasing the >> number of sinks draining from the channel help. >> The idea behind this is to basically achieve pipelining by having >> multiple outstanding requests and thus use network better. >> >> Increasing the number of sinks will increase concurrency. >> >> >> 1. If batch size is very large, e.g. 1 million, would the batch only >> be sent once that many events have accumulated or is there a time limit >> after which whatever events are accumulated are sent? Is this timelimit >> configurable? (I looked in the Avro Sink documentation for such a setting: >> http://flume.apache.org/FlumeUserGuide.html, but couldn't find >> anything, hence asking the question) >> >> IMO...Not a good idea to have such a large batch.. esp if you like to >> have concurrent sinks. each sink will need to wait for 1mill events to >> close the transactions on the channel. >> >> >> 1. Does enabling ssl have any significant impact on throughput? >> Increase in latency is expected but does this also affect throughput. >> >> perhaps somebody can comment on this. >> >> -roshan >> >> >> >> On Fri, Jul 26, 2013 at 12:34 AM, Derek Chan <[email protected]> wrote: >> >>> We have a similar setup (Flume 1.3) and same problems here. Increasing >>> the batch size did not help much but setting up multiple AvroSinks did. >>> >>> >>> On 26/7/2013 9:31, Pankaj Gupta wrote: >>> >>> Hi, >>> >>> We are trying to figure out how to get better throughput in our flume >>> pipeline. We have flume instances on a lot of machines writing to a few >>> collector machines running with a File Channel which in turn write to still >>> fewer hdfs writer machines running with a File Channel and HDFS Sinks. >>> >>> The problem that we're facing is that we are not getting good network >>> usage between our flume collector machines and hdfs writer machines. The >>> way these machines are connected is that the filechannel on collector >>> drains to an Avro Sink which sends to Avro Source on the writer machine, >>> which in turn writes to a filechannel draining into an HDFS Sink. So: >>> >>> [FileChannel -> Avro Sink] -> [Avro Source -> FileChannel -> HDFS Sink] >>> >>> I did a raw network throughput test(using netcat on the command line) >>> between the collector and the writer and saw a throughput of ~* >>> 200Megabits*/sec. Whereas the network throughput (which I observed >>> using iftop) between collector avro sink and writer avro source never went >>> over *25Megabits*/sec, even when the filechannel on the collector was >>> quite full with millions of events queued up. We obviously want to use the >>> network better and I am exploring ways of achieving that. The batch size we >>> are using on avro sink on the collector is 4000. >>> >>> I have a few questions regarding how AvroSource and Sink work together >>> to help me improve the throughput and will really appreciate a response: >>> >>> 1. Are the batches from flume source sent to the sink in a pipelined >>> fasion or is the next batch only sent once an ack for previous batch is >>> received? >>> 2. If the batch send is not pipelined then would increasing the >>> number of sinks draining from the channel help. >>> The idea behind this is to basically achieve pipelining by having >>> multiple outstanding requests and thus use network better. >>> 3. If batch size is very large, e.g. 1 million, would the batch only >>> be sent once that many events have accumulated or is there a time limit >>> after which whatever events are accumulated are sent? Is this timelimit >>> configurable? (I looked in the Avro Sink documentation for such a >>> setting: >>> http://flume.apache.org/FlumeUserGuide.html, but couldn't find >>> anything, hence asking the question) >>> 4. Does enabling ssl have any significant impact on throughput? >>> Increase in latency is expected but does this also affect throughput. >>> >>> We are using flume 1.4.0. >>> >>> Thanks in Advance, >>> Pankaj >>> >>> -- >>> >>> >>> *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | >>> [email protected] >>> >>> Pankaj Gupta | Software Engineer >>> >>> *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com >>> >>> >>> United States | Canada | United Kingdom | Germany >>> >>> >>> We're >>> hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7> >>> ! >>> >>> >>> >> > > > -- > > > *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [email protected] > > Pankaj Gupta | Software Engineer > > *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com > > > United States | Canada | United Kingdom | Germany > > > We're > hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7> > ! > -- *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [email protected] Pankaj Gupta | Software Engineer *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com United States | Canada | United Kingdom | Germany We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7> !
