In my experiment, I just want to transfer a single file...just to test what performances I can achieve... so rolling file on hdfs at this point is not vital. Anyway I did some test rolling file every 300 seconds. What I can't explain to myself is the "slow" output from the sink...the memory channel overflows (if it's not big enough so it seems that the souce is able to produce a higher data rate than the data rate the sink is able to process and send on hdfs) I'm not sure if it can helps to pinpoint my "configuration mistake", but I'm using Flume 1.5.0.1 (tried also Flume 1.5.0) Regards. Seba
On Wed, Sep 3, 2014 at 9:38 AM, Sandeep Khurana <skhurana...@gmail.com> wrote: > I see that you have below settings set to zero. You dont want rolling to > hdfs to happen based upon any of the size, count or time interval? > > test.sinks.s1.hdfs.rollSize = 0 > test.sinks.s1.hdfs.rollCount = 0 > test.sinks.s1.hdfs.rollInterval = 0 > > > On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola < > sebastiano.dipa...@gmail.com> wrote: > >> Hi Paul, >> thank for your answer. >> As I' m a newbie of Flume How can I attach multiple sinks to the same >> channel? (does they read data in a round robin fashon from the memory >> channel?) >> (does this create multiple files on the hdfs?, because this is not what >> I'm expecting to have I have a 500MB data file at the source and I would >> like to have only one file on HDFS) >> >> I can't believe that I cannot achieve such a performance with a single >> sink. I'm pretty sure it's a configuration issue! >> Beside this how to tune the batchSize parameter? (Of course I have >> already tried to set it like 10 times the number I have in my config, but >> no relevant improvements) >> Regards. >> Seba >> >> >> On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pcha...@ntent.com> wrote: >> >>> Start adding additional HDFS sinks attached to the same channel. You >>> can also tune batch sizes when writing to HDFS to increase per sink >>> performance. >>> >>> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" < >>> sebastiano.dipa...@gmail.com> wrote: >>> >>> Hi there, >>> I'm a completely newbie of Flume, so I probably made a mistake in my >>> configuration but I cannot point it out. >>> I want to achieve transfer maximum performances. >>> My flume machine has 16GB RAM and 8 Cores >>> I'm using a very simple Flume architecture: >>> Source -> Memory Channel -> Sink >>> Source is of type netcat >>> and Sink is hdfs >>> The machine has 1Gb ethernet directly connected to the switch of the >>> hadoop cluster. >>> The point is that Flume is sooo slow in loading the data into my hdfs >>> filesystem. >>> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from >>> the same machine I will reach approx 250 Mb/s as transfer rate, while >>> transferring the same file with this Flume architecture is like 2-3 Mb/s). >>> (the cluster is composed of 10 machines, and was totally idle while I did >>> this test, so was not under stress) (the traffic rate was measured on the >>> flume machine output interface in both exeperiments) >>> (myfile has 10 million of lines of average size of 150 bytes each) >>> >>> For what I understood till now It doesn't seem a source issue as the >>> memory channel tends to fill up if I decrease the channel capacity (but >>> even make it very very very very big it does not affect sink perfomances), >>> so it seems to me that the problem is related to sink. >>> In order to test this point I've also tried to change the source using >>> "exec" type and simply executing "cat myfile" but the result hasn't >>> changed.... >>> >>> >>> Here's my used config... >>> >>> # list the sources, sinks and channels for the agent >>> test.sources = r1 >>> test.channels = c1 >>> test.sinks = s1 >>> >>> # exec attempt >>> test.sources.r1.type = exec >>> test.sources.r1.command = cat /tmp/myfile >>> >>> # my netcat attempt >>> #test.sources.r1.type = netcat >>> #test.sources.r1.bind = localhost >>> #test.sources.r1.port = 6666 >>> >>> # my file channel attempt >>> #test.channels.c1.type = file >>> >>> #my memory channel attempt >>> test.channels.c1.type = memory >>> test.channels.c1.capacity = 1000000 >>> test.channels.c1.transactionCapacity = 10000 >>> >>> # how to properly set those parameter?? even if I enable those nothing >>> changes >>> # in my performances (what it the buffer percentage used for?) >>> #test.channels.c1.byteCapacityBufferPercentage = 50 >>> #test.channels.c1.byteCapacity = 100000000 >>> >>> # set channel for source >>> test.sources.r1.channels = c1 >>> # set channel for sink >>> test.sinks.s1.channel = c1 >>> >>> test.sinks.s1.type = hdfs >>> test.sinks.s1.hdfs.useLocalTimeStamp = true >>> >>> test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/* >>> test.sinks.s1.hdfs.filePrefix = log-data >>> test.sinks.s1.hdfs.inUseSuffix = .dat >>> >>> # how to set this parameter??? (i basically want to send as much data >>> as I can) >>> test.sinks.s1.hdfs.batchSize = 10000 >>> >>> #test.sinks.s1.hdfs.round = true >>> #test.sinks.s1.hdfs.roundValue = 5 >>> #test.sinks.s1.hdfs.roundUnit = minute >>> >>> test.sinks.s1.hdfs.rollSize = 0 >>> test.sinks.s1.hdfs.rollCount = 0 >>> test.sinks.s1.hdfs.rollInterval = 0 >>> >>> # compression attempt >>> #test.sinks.s1.hdsf.fileType = CompressedStream >>> #test.sinks.s1.hdfs.codeC=gzip >>> #test.sinks.s1.hdfs.codeC=BZip2Codec >>> #test.sinks.s1.hdfs.callTimeout = 120000 >>> >>> Can someone show me how to find this bottleneck/ configuration >>> mistake? (I can't believe that those are flume performance on my machine) >>> >>> Thanks a lot if you can help me >>> Regards. >>> Sebastiano >>> >>> >>> >> > > > -- > Thanks and regards > Sandeep Khurana >