Hi Paul, thank for your answer. As I' m a newbie of Flume How can I attach multiple sinks to the same channel? (does they read data in a round robin fashon from the memory channel?) (does this create multiple files on the hdfs?, because this is not what I'm expecting to have I have a 500MB data file at the source and I would like to have only one file on HDFS)
I can't believe that I cannot achieve such a performance with a single sink. I'm pretty sure it's a configuration issue! Beside this how to tune the batchSize parameter? (Of course I have already tried to set it like 10 times the number I have in my config, but no relevant improvements) Regards. Seba On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pcha...@ntent.com> wrote: > Start adding additional HDFS sinks attached to the same channel. You can > also tune batch sizes when writing to HDFS to increase per sink performance. > > On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" < > sebastiano.dipa...@gmail.com> wrote: > > Hi there, > I'm a completely newbie of Flume, so I probably made a mistake in my > configuration but I cannot point it out. > I want to achieve transfer maximum performances. > My flume machine has 16GB RAM and 8 Cores > I'm using a very simple Flume architecture: > Source -> Memory Channel -> Sink > Source is of type netcat > and Sink is hdfs > The machine has 1Gb ethernet directly connected to the switch of the > hadoop cluster. > The point is that Flume is sooo slow in loading the data into my hdfs > filesystem. > (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from > the same machine I will reach approx 250 Mb/s as transfer rate, while > transferring the same file with this Flume architecture is like 2-3 Mb/s). > (the cluster is composed of 10 machines, and was totally idle while I did > this test, so was not under stress) (the traffic rate was measured on the > flume machine output interface in both exeperiments) > (myfile has 10 million of lines of average size of 150 bytes each) > > For what I understood till now It doesn't seem a source issue as the > memory channel tends to fill up if I decrease the channel capacity (but > even make it very very very very big it does not affect sink perfomances), > so it seems to me that the problem is related to sink. > In order to test this point I've also tried to change the source using > "exec" type and simply executing "cat myfile" but the result hasn't > changed.... > > > Here's my used config... > > # list the sources, sinks and channels for the agent > test.sources = r1 > test.channels = c1 > test.sinks = s1 > > # exec attempt > test.sources.r1.type = exec > test.sources.r1.command = cat /tmp/myfile > > # my netcat attempt > #test.sources.r1.type = netcat > #test.sources.r1.bind = localhost > #test.sources.r1.port = 6666 > > # my file channel attempt > #test.channels.c1.type = file > > #my memory channel attempt > test.channels.c1.type = memory > test.channels.c1.capacity = 1000000 > test.channels.c1.transactionCapacity = 10000 > > # how to properly set those parameter?? even if I enable those nothing > changes > # in my performances (what it the buffer percentage used for?) > #test.channels.c1.byteCapacityBufferPercentage = 50 > #test.channels.c1.byteCapacity = 100000000 > > # set channel for source > test.sources.r1.channels = c1 > # set channel for sink > test.sinks.s1.channel = c1 > > test.sinks.s1.type = hdfs > test.sinks.s1.hdfs.useLocalTimeStamp = true > > test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/* > test.sinks.s1.hdfs.filePrefix = log-data > test.sinks.s1.hdfs.inUseSuffix = .dat > > # how to set this parameter??? (i basically want to send as much data as > I can) > test.sinks.s1.hdfs.batchSize = 10000 > > #test.sinks.s1.hdfs.round = true > #test.sinks.s1.hdfs.roundValue = 5 > #test.sinks.s1.hdfs.roundUnit = minute > > test.sinks.s1.hdfs.rollSize = 0 > test.sinks.s1.hdfs.rollCount = 0 > test.sinks.s1.hdfs.rollInterval = 0 > > # compression attempt > #test.sinks.s1.hdsf.fileType = CompressedStream > #test.sinks.s1.hdfs.codeC=gzip > #test.sinks.s1.hdfs.codeC=BZip2Codec > #test.sinks.s1.hdfs.callTimeout = 120000 > > Can someone show me how to find this bottleneck/ configuration mistake? > (I can't believe that those are flume performance on my machine) > > Thanks a lot if you can help me > Regards. > Sebastiano > > >