I raised batchSize of a 100 factor, added more heap space and speed increased... still not reached the same speed as using "hdfs dfs -copyFromLocal" but I'm pretty sure it's a tuning problem. thanks a lot for your hint. Regards Seba
On Wed, Sep 3, 2014 at 9:55 AM, Sandeep Khurana <skhurana...@gmail.com> wrote: > Since you mentioned "average size of 150 bytes each" is your each record, > I will try increasing the batch size to a higher value. > > > "HDFS batch size determines the number of events to take from the channel and > send in one go." > > So in 1 shot you are sending 1500000 bytes to hdfs. > > > On Wed, Sep 3, 2014 at 1:18 PM, Sebastiano Di Paola < > sebastiano.dipa...@gmail.com> wrote: > >> In my experiment, I just want to transfer a single file...just to test >> what performances I can achieve... >> so rolling file on hdfs at this point is not vital. >> Anyway I did some test rolling file every 300 seconds. >> What I can't explain to myself is the "slow" output from the sink...the >> memory channel overflows (if it's not big enough so it seems that the souce >> is able to produce a higher data rate than the data rate the sink is able >> to process and send on hdfs) >> I'm not sure if it can helps to pinpoint my "configuration mistake", but >> I'm using Flume 1.5.0.1 (tried also Flume 1.5.0) >> Regards. >> Seba >> >> >> On Wed, Sep 3, 2014 at 9:38 AM, Sandeep Khurana <skhurana...@gmail.com> >> wrote: >> >>> I see that you have below settings set to zero. You dont want rolling to >>> hdfs to happen based upon any of the size, count or time interval? >>> >>> test.sinks.s1.hdfs.rollSize = 0 >>> test.sinks.s1.hdfs.rollCount = 0 >>> test.sinks.s1.hdfs.rollInterval = 0 >>> >>> >>> On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola < >>> sebastiano.dipa...@gmail.com> wrote: >>> >>>> Hi Paul, >>>> thank for your answer. >>>> As I' m a newbie of Flume How can I attach multiple sinks to the same >>>> channel? (does they read data in a round robin fashon from the memory >>>> channel?) >>>> (does this create multiple files on the hdfs?, because this is not >>>> what I'm expecting to have I have a 500MB data file at the source and I >>>> would like to have only one file on HDFS) >>>> >>>> I can't believe that I cannot achieve such a performance with a single >>>> sink. I'm pretty sure it's a configuration issue! >>>> Beside this how to tune the batchSize parameter? (Of course I have >>>> already tried to set it like 10 times the number I have in my config, but >>>> no relevant improvements) >>>> Regards. >>>> Seba >>>> >>>> >>>> On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pcha...@ntent.com> wrote: >>>> >>>>> Start adding additional HDFS sinks attached to the same channel. You >>>>> can also tune batch sizes when writing to HDFS to increase per sink >>>>> performance. >>>>> >>>>> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" < >>>>> sebastiano.dipa...@gmail.com> wrote: >>>>> >>>>> Hi there, >>>>> I'm a completely newbie of Flume, so I probably made a mistake in my >>>>> configuration but I cannot point it out. >>>>> I want to achieve transfer maximum performances. >>>>> My flume machine has 16GB RAM and 8 Cores >>>>> I'm using a very simple Flume architecture: >>>>> Source -> Memory Channel -> Sink >>>>> Source is of type netcat >>>>> and Sink is hdfs >>>>> The machine has 1Gb ethernet directly connected to the switch of the >>>>> hadoop cluster. >>>>> The point is that Flume is sooo slow in loading the data into my hdfs >>>>> filesystem. >>>>> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile >>>>> from the same machine I will reach approx 250 Mb/s as transfer rate, while >>>>> transferring the same file with this Flume architecture is like 2-3 Mb/s). >>>>> (the cluster is composed of 10 machines, and was totally idle while I did >>>>> this test, so was not under stress) (the traffic rate was measured on the >>>>> flume machine output interface in both exeperiments) >>>>> (myfile has 10 million of lines of average size of 150 bytes each) >>>>> >>>>> For what I understood till now It doesn't seem a source issue as the >>>>> memory channel tends to fill up if I decrease the channel capacity (but >>>>> even make it very very very very big it does not affect sink perfomances), >>>>> so it seems to me that the problem is related to sink. >>>>> In order to test this point I've also tried to change the source using >>>>> "exec" type and simply executing "cat myfile" but the result hasn't >>>>> changed.... >>>>> >>>>> >>>>> Here's my used config... >>>>> >>>>> # list the sources, sinks and channels for the agent >>>>> test.sources = r1 >>>>> test.channels = c1 >>>>> test.sinks = s1 >>>>> >>>>> # exec attempt >>>>> test.sources.r1.type = exec >>>>> test.sources.r1.command = cat /tmp/myfile >>>>> >>>>> # my netcat attempt >>>>> #test.sources.r1.type = netcat >>>>> #test.sources.r1.bind = localhost >>>>> #test.sources.r1.port = 6666 >>>>> >>>>> # my file channel attempt >>>>> #test.channels.c1.type = file >>>>> >>>>> #my memory channel attempt >>>>> test.channels.c1.type = memory >>>>> test.channels.c1.capacity = 1000000 >>>>> test.channels.c1.transactionCapacity = 10000 >>>>> >>>>> # how to properly set those parameter?? even if I enable those >>>>> nothing changes >>>>> # in my performances (what it the buffer percentage used for?) >>>>> #test.channels.c1.byteCapacityBufferPercentage = 50 >>>>> #test.channels.c1.byteCapacity = 100000000 >>>>> >>>>> # set channel for source >>>>> test.sources.r1.channels = c1 >>>>> # set channel for sink >>>>> test.sinks.s1.channel = c1 >>>>> >>>>> test.sinks.s1.type = hdfs >>>>> test.sinks.s1.hdfs.useLocalTimeStamp = true >>>>> >>>>> test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/* >>>>> test.sinks.s1.hdfs.filePrefix = log-data >>>>> test.sinks.s1.hdfs.inUseSuffix = .dat >>>>> >>>>> # how to set this parameter??? (i basically want to send as much >>>>> data as I can) >>>>> test.sinks.s1.hdfs.batchSize = 10000 >>>>> >>>>> #test.sinks.s1.hdfs.round = true >>>>> #test.sinks.s1.hdfs.roundValue = 5 >>>>> #test.sinks.s1.hdfs.roundUnit = minute >>>>> >>>>> test.sinks.s1.hdfs.rollSize = 0 >>>>> test.sinks.s1.hdfs.rollCount = 0 >>>>> test.sinks.s1.hdfs.rollInterval = 0 >>>>> >>>>> # compression attempt >>>>> #test.sinks.s1.hdsf.fileType = CompressedStream >>>>> #test.sinks.s1.hdfs.codeC=gzip >>>>> #test.sinks.s1.hdfs.codeC=BZip2Codec >>>>> #test.sinks.s1.hdfs.callTimeout = 120000 >>>>> >>>>> Can someone show me how to find this bottleneck/ configuration >>>>> mistake? (I can't believe that those are flume performance on my machine) >>>>> >>>>> Thanks a lot if you can help me >>>>> Regards. >>>>> Sebastiano >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Thanks and regards >>> Sandeep Khurana >>> >> >> > > > -- > Thanks and regards > Sandeep Khurana >