I see that you have below settings set to zero. You dont want rolling to hdfs to happen based upon any of the size, count or time interval?
test.sinks.s1.hdfs.rollSize = 0 test.sinks.s1.hdfs.rollCount = 0 test.sinks.s1.hdfs.rollInterval = 0 On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola < sebastiano.dipa...@gmail.com> wrote: > Hi Paul, > thank for your answer. > As I' m a newbie of Flume How can I attach multiple sinks to the same > channel? (does they read data in a round robin fashon from the memory > channel?) > (does this create multiple files on the hdfs?, because this is not what > I'm expecting to have I have a 500MB data file at the source and I would > like to have only one file on HDFS) > > I can't believe that I cannot achieve such a performance with a single > sink. I'm pretty sure it's a configuration issue! > Beside this how to tune the batchSize parameter? (Of course I have already > tried to set it like 10 times the number I have in my config, but no > relevant improvements) > Regards. > Seba > > > On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pcha...@ntent.com> wrote: > >> Start adding additional HDFS sinks attached to the same channel. You >> can also tune batch sizes when writing to HDFS to increase per sink >> performance. >> >> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" < >> sebastiano.dipa...@gmail.com> wrote: >> >> Hi there, >> I'm a completely newbie of Flume, so I probably made a mistake in my >> configuration but I cannot point it out. >> I want to achieve transfer maximum performances. >> My flume machine has 16GB RAM and 8 Cores >> I'm using a very simple Flume architecture: >> Source -> Memory Channel -> Sink >> Source is of type netcat >> and Sink is hdfs >> The machine has 1Gb ethernet directly connected to the switch of the >> hadoop cluster. >> The point is that Flume is sooo slow in loading the data into my hdfs >> filesystem. >> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from >> the same machine I will reach approx 250 Mb/s as transfer rate, while >> transferring the same file with this Flume architecture is like 2-3 Mb/s). >> (the cluster is composed of 10 machines, and was totally idle while I did >> this test, so was not under stress) (the traffic rate was measured on the >> flume machine output interface in both exeperiments) >> (myfile has 10 million of lines of average size of 150 bytes each) >> >> For what I understood till now It doesn't seem a source issue as the >> memory channel tends to fill up if I decrease the channel capacity (but >> even make it very very very very big it does not affect sink perfomances), >> so it seems to me that the problem is related to sink. >> In order to test this point I've also tried to change the source using >> "exec" type and simply executing "cat myfile" but the result hasn't >> changed.... >> >> >> Here's my used config... >> >> # list the sources, sinks and channels for the agent >> test.sources = r1 >> test.channels = c1 >> test.sinks = s1 >> >> # exec attempt >> test.sources.r1.type = exec >> test.sources.r1.command = cat /tmp/myfile >> >> # my netcat attempt >> #test.sources.r1.type = netcat >> #test.sources.r1.bind = localhost >> #test.sources.r1.port = 6666 >> >> # my file channel attempt >> #test.channels.c1.type = file >> >> #my memory channel attempt >> test.channels.c1.type = memory >> test.channels.c1.capacity = 1000000 >> test.channels.c1.transactionCapacity = 10000 >> >> # how to properly set those parameter?? even if I enable those nothing >> changes >> # in my performances (what it the buffer percentage used for?) >> #test.channels.c1.byteCapacityBufferPercentage = 50 >> #test.channels.c1.byteCapacity = 100000000 >> >> # set channel for source >> test.sources.r1.channels = c1 >> # set channel for sink >> test.sinks.s1.channel = c1 >> >> test.sinks.s1.type = hdfs >> test.sinks.s1.hdfs.useLocalTimeStamp = true >> >> test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/* >> test.sinks.s1.hdfs.filePrefix = log-data >> test.sinks.s1.hdfs.inUseSuffix = .dat >> >> # how to set this parameter??? (i basically want to send as much data >> as I can) >> test.sinks.s1.hdfs.batchSize = 10000 >> >> #test.sinks.s1.hdfs.round = true >> #test.sinks.s1.hdfs.roundValue = 5 >> #test.sinks.s1.hdfs.roundUnit = minute >> >> test.sinks.s1.hdfs.rollSize = 0 >> test.sinks.s1.hdfs.rollCount = 0 >> test.sinks.s1.hdfs.rollInterval = 0 >> >> # compression attempt >> #test.sinks.s1.hdsf.fileType = CompressedStream >> #test.sinks.s1.hdfs.codeC=gzip >> #test.sinks.s1.hdfs.codeC=BZip2Codec >> #test.sinks.s1.hdfs.callTimeout = 120000 >> >> Can someone show me how to find this bottleneck/ configuration mistake? >> (I can't believe that those are flume performance on my machine) >> >> Thanks a lot if you can help me >> Regards. >> Sebastiano >> >> >> > -- Thanks and regards Sandeep Khurana