Start adding additional HDFS sinks attached to the same channel. You can also tune batch sizes when writing to HDFS to increase per sink performance.
On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <sebastiano.dipa...@gmail.com<mailto:sebastiano.dipa...@gmail.com>> wrote: Hi there, I'm a completely newbie of Flume, so I probably made a mistake in my configuration but I cannot point it out. I want to achieve transfer maximum performances. My flume machine has 16GB RAM and 8 Cores I'm using a very simple Flume architecture: Source -> Memory Channel -> Sink Source is of type netcat and Sink is hdfs The machine has 1Gb ethernet directly connected to the switch of the hadoop cluster. The point is that Flume is sooo slow in loading the data into my hdfs filesystem. (i.e. using hdfs dfs -copyFromLocal myfile /flume/events/myfile from the same machine I will reach approx 250 Mb/s as transfer rate, while transferring the same file with this Flume architecture is like 2-3 Mb/s). (the cluster is composed of 10 machines, and was totally idle while I did this test, so was not under stress) (the traffic rate was measured on the flume machine output interface in both exeperiments) (myfile has 10 million of lines of average size of 150 bytes each) For what I understood till now It doesn't seem a source issue as the memory channel tends to fill up if I decrease the channel capacity (but even make it very very very very big it does not affect sink perfomances), so it seems to me that the problem is related to sink. In order to test this point I've also tried to change the source using "exec" type and simply executing "cat myfile" but the result hasn't changed.... Here's my used config... # list the sources, sinks and channels for the agent test.sources = r1 test.channels = c1 test.sinks = s1 # exec attempt test.sources.r1.type = exec test.sources.r1.command = cat /tmp/myfile # my netcat attempt #test.sources.r1.type = netcat #test.sources.r1.bind = localhost #test.sources.r1.port = 6666 # my file channel attempt #test.channels.c1.type = file #my memory channel attempt test.channels.c1.type = memory test.channels.c1.capacity = 1000000 test.channels.c1.transactionCapacity = 10000 # how to properly set those parameter?? even if I enable those nothing changes # in my performances (what it the buffer percentage used for?) #test.channels.c1.byteCapacityBufferPercentage = 50 #test.channels.c1.byteCapacity = 100000000 # set channel for source test.sources.r1.channels = c1 # set channel for sink test.sinks.s1.channel = c1 test.sinks.s1.type = hdfs test.sinks.s1.hdfs.useLocalTimeStamp = true test.sinks.s1.hdfs.path = hdfs://mynodemanager:9000/flume/events/ test.sinks.s1.hdfs.filePrefix = log-data test.sinks.s1.hdfs.inUseSuffix = .dat # how to set this parameter??? (i basically want to send as much data as I can) test.sinks.s1.hdfs.batchSize = 10000 #test.sinks.s1.hdfs.round = true #test.sinks.s1.hdfs.roundValue = 5 #test.sinks.s1.hdfs.roundUnit = minute test.sinks.s1.hdfs.rollSize = 0 test.sinks.s1.hdfs.rollCount = 0 test.sinks.s1.hdfs.rollInterval = 0 # compression attempt #test.sinks.s1.hdsf.fileType = CompressedStream #test.sinks.s1.hdfs.codeC=gzip #test.sinks.s1.hdfs.codeC=BZip2Codec #test.sinks.s1.hdfs.callTimeout = 120000 Can someone show me how to find this bottleneck/ configuration mistake? (I can't believe that those are flume performance on my machine) Thanks a lot if you can help me Regards. Sebastiano