Since you mentioned "average size of 150 bytes each" is your each record, I will try increasing the batch size to a higher value.
"HDFS batch size determines the number of events to take from the channel and send in one go." So in 1 shot you are sending 1500000 bytes to hdfs. On Wed, Sep 3, 2014 at 1:18 PM, Sebastiano Di Paola < sebastiano.dipa...@gmail.com> wrote: > In my experiment, I just want to transfer a single file...just to test > what performances I can achieve... > so rolling file on hdfs at this point is not vital. > Anyway I did some test rolling file every 300 seconds. > What I can't explain to myself is the "slow" output from the sink...the > memory channel overflows (if it's not big enough so it seems that the souce > is able to produce a higher data rate than the data rate the sink is able > to process and send on hdfs) > I'm not sure if it can helps to pinpoint my "configuration mistake", but > I'm using Flume 1.5.0.1 (tried also Flume 1.5.0) > Regards. > Seba > > > On Wed, Sep 3, 2014 at 9:38 AM, Sandeep Khurana <skhurana...@gmail.com> > wrote: > >> I see that you have below settings set to zero. You dont want rolling to >> hdfs to happen based upon any of the size, count or time interval? >> >> test.sinks.s1.hdfs.rollSize = 0 >> test.sinks.s1.hdfs.rollCount = 0 >> test.sinks.s1.hdfs.rollInterval = 0 >> >> >> On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola < >> sebastiano.dipa...@gmail.com> wrote: >> >>> Hi Paul, >>> thank for your answer. >>> As I' m a newbie of Flume How can I attach multiple sinks to the same >>> channel? (does they read data in a round robin fashon from the memory >>> channel?) >>> (does this create multiple files on the hdfs?, because this is not what >>> I'm expecting to have I have a 500MB data file at the source and I would >>> like to have only one file on HDFS) >>> >>> I can't believe that I cannot achieve such a performance with a single >>> sink. I'm pretty sure it's a configuration issue! >>> Beside this how to tune the batchSize parameter? (Of course I have >>> already tried to set it like 10 times the number I have in my config, but >>> no relevant improvements) >>> Regards. >>> Seba >>> >>> >>> On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pcha...@ntent.com> wrote: >>> >>>> Start adding additional HDFS sinks attached to the same channel. You >>>> can also tune batch sizes when writing to HDFS to increase per sink >>>> performance. >>>> >>>> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" < >>>> sebastiano.dipa...@gmail.com> wrote: >>>> >>>> Hi there, >>>> I'm a completely newbie of Flume, so I probably made a mistake in my >>>> configuration but I cannot point it out. >>>> I want to achieve transfer maximum performances. >>>> My flume machine has 16GB RAM and 8 Cores >>>> I'm using a very simple Flume architecture: >>>> Source -> Memory Channel -> Sink >>>> Source is of type netcat >>>> and Sink is hdfs >>>> The machine has 1Gb ethernet directly connected to the switch of the >>>> hadoop cluster. >>>> The point is that Flume is sooo slow in loading the data into my hdfs >>>> filesystem. >>>> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from >>>> the same machine I will reach approx 250 Mb/s as transfer rate, while >>>> transferring the same file with this Flume architecture is like 2-3 Mb/s). >>>> (the cluster is composed of 10 machines, and was totally idle while I did >>>> this test, so was not under stress) (the traffic rate was measured on the >>>> flume machine output interface in both exeperiments) >>>> (myfile has 10 million of lines of average size of 150 bytes each) >>>> >>>> For what I understood till now It doesn't seem a source issue as the >>>> memory channel tends to fill up if I decrease the channel capacity (but >>>> even make it very very very very big it does not affect sink perfomances), >>>> so it seems to me that the problem is related to sink. >>>> In order to test this point I've also tried to change the source using >>>> "exec" type and simply executing "cat myfile" but the result hasn't >>>> changed.... >>>> >>>> >>>> Here's my used config... >>>> >>>> # list the sources, sinks and channels for the agent >>>> test.sources = r1 >>>> test.channels = c1 >>>> test.sinks = s1 >>>> >>>> # exec attempt >>>> test.sources.r1.type = exec >>>> test.sources.r1.command = cat /tmp/myfile >>>> >>>> # my netcat attempt >>>> #test.sources.r1.type = netcat >>>> #test.sources.r1.bind = localhost >>>> #test.sources.r1.port = 6666 >>>> >>>> # my file channel attempt >>>> #test.channels.c1.type = file >>>> >>>> #my memory channel attempt >>>> test.channels.c1.type = memory >>>> test.channels.c1.capacity = 1000000 >>>> test.channels.c1.transactionCapacity = 10000 >>>> >>>> # how to properly set those parameter?? even if I enable those >>>> nothing changes >>>> # in my performances (what it the buffer percentage used for?) >>>> #test.channels.c1.byteCapacityBufferPercentage = 50 >>>> #test.channels.c1.byteCapacity = 100000000 >>>> >>>> # set channel for source >>>> test.sources.r1.channels = c1 >>>> # set channel for sink >>>> test.sinks.s1.channel = c1 >>>> >>>> test.sinks.s1.type = hdfs >>>> test.sinks.s1.hdfs.useLocalTimeStamp = true >>>> >>>> test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/* >>>> test.sinks.s1.hdfs.filePrefix = log-data >>>> test.sinks.s1.hdfs.inUseSuffix = .dat >>>> >>>> # how to set this parameter??? (i basically want to send as much data >>>> as I can) >>>> test.sinks.s1.hdfs.batchSize = 10000 >>>> >>>> #test.sinks.s1.hdfs.round = true >>>> #test.sinks.s1.hdfs.roundValue = 5 >>>> #test.sinks.s1.hdfs.roundUnit = minute >>>> >>>> test.sinks.s1.hdfs.rollSize = 0 >>>> test.sinks.s1.hdfs.rollCount = 0 >>>> test.sinks.s1.hdfs.rollInterval = 0 >>>> >>>> # compression attempt >>>> #test.sinks.s1.hdsf.fileType = CompressedStream >>>> #test.sinks.s1.hdfs.codeC=gzip >>>> #test.sinks.s1.hdfs.codeC=BZip2Codec >>>> #test.sinks.s1.hdfs.callTimeout = 120000 >>>> >>>> Can someone show me how to find this bottleneck/ configuration >>>> mistake? (I can't believe that those are flume performance on my machine) >>>> >>>> Thanks a lot if you can help me >>>> Regards. >>>> Sebastiano >>>> >>>> >>>> >>> >> >> >> -- >> Thanks and regards >> Sandeep Khurana >> > > -- Thanks and regards Sandeep Khurana