Which was your final configuration? What speed did you get? 2014-09-03 11:18 GMT+02:00 Sebastiano Di Paola <sebastiano.dipa...@gmail.com >:
> I raised batchSize of a 100 factor, added more heap space and speed > increased... > still not reached the same speed as using "hdfs dfs -copyFromLocal" but > I'm pretty sure it's a tuning problem. > thanks a lot for your hint. > Regards > Seba > > > On Wed, Sep 3, 2014 at 9:55 AM, Sandeep Khurana <skhurana...@gmail.com> > wrote: > >> Since you mentioned "average size of 150 bytes each" is your each >> record, I will try increasing the batch size to a higher value. >> >> >> "HDFS batch size determines the number of events to take from the >> channel and send in one go." >> >> So in 1 shot you are sending 1500000 bytes to hdfs. >> >> >> On Wed, Sep 3, 2014 at 1:18 PM, Sebastiano Di Paola < >> sebastiano.dipa...@gmail.com> wrote: >> >>> In my experiment, I just want to transfer a single file...just to test >>> what performances I can achieve... >>> so rolling file on hdfs at this point is not vital. >>> Anyway I did some test rolling file every 300 seconds. >>> What I can't explain to myself is the "slow" output from the sink...the >>> memory channel overflows (if it's not big enough so it seems that the souce >>> is able to produce a higher data rate than the data rate the sink is able >>> to process and send on hdfs) >>> I'm not sure if it can helps to pinpoint my "configuration mistake", but >>> I'm using Flume 1.5.0.1 (tried also Flume 1.5.0) >>> Regards. >>> Seba >>> >>> >>> On Wed, Sep 3, 2014 at 9:38 AM, Sandeep Khurana <skhurana...@gmail.com> >>> wrote: >>> >>>> I see that you have below settings set to zero. You dont want rolling >>>> to hdfs to happen based upon any of the size, count or time interval? >>>> >>>> test.sinks.s1.hdfs.rollSize = 0 >>>> test.sinks.s1.hdfs.rollCount = 0 >>>> test.sinks.s1.hdfs.rollInterval = 0 >>>> >>>> >>>> On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola < >>>> sebastiano.dipa...@gmail.com> wrote: >>>> >>>>> Hi Paul, >>>>> thank for your answer. >>>>> As I' m a newbie of Flume How can I attach multiple sinks to the same >>>>> channel? (does they read data in a round robin fashon from the memory >>>>> channel?) >>>>> (does this create multiple files on the hdfs?, because this is not >>>>> what I'm expecting to have I have a 500MB data file at the source and I >>>>> would like to have only one file on HDFS) >>>>> >>>>> I can't believe that I cannot achieve such a performance with a single >>>>> sink. I'm pretty sure it's a configuration issue! >>>>> Beside this how to tune the batchSize parameter? (Of course I have >>>>> already tried to set it like 10 times the number I have in my config, but >>>>> no relevant improvements) >>>>> Regards. >>>>> Seba >>>>> >>>>> >>>>> On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pcha...@ntent.com> wrote: >>>>> >>>>>> Start adding additional HDFS sinks attached to the same channel. >>>>>> You can also tune batch sizes when writing to HDFS to increase per sink >>>>>> performance. >>>>>> >>>>>> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" < >>>>>> sebastiano.dipa...@gmail.com> wrote: >>>>>> >>>>>> Hi there, >>>>>> I'm a completely newbie of Flume, so I probably made a mistake in my >>>>>> configuration but I cannot point it out. >>>>>> I want to achieve transfer maximum performances. >>>>>> My flume machine has 16GB RAM and 8 Cores >>>>>> I'm using a very simple Flume architecture: >>>>>> Source -> Memory Channel -> Sink >>>>>> Source is of type netcat >>>>>> and Sink is hdfs >>>>>> The machine has 1Gb ethernet directly connected to the switch of the >>>>>> hadoop cluster. >>>>>> The point is that Flume is sooo slow in loading the data into my hdfs >>>>>> filesystem. >>>>>> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile >>>>>> from the same machine I will reach approx 250 Mb/s as transfer rate, >>>>>> while >>>>>> transferring the same file with this Flume architecture is like 2-3 >>>>>> Mb/s). >>>>>> (the cluster is composed of 10 machines, and was totally idle while I did >>>>>> this test, so was not under stress) (the traffic rate was measured on the >>>>>> flume machine output interface in both exeperiments) >>>>>> (myfile has 10 million of lines of average size of 150 bytes each) >>>>>> >>>>>> For what I understood till now It doesn't seem a source issue as >>>>>> the memory channel tends to fill up if I decrease the channel capacity >>>>>> (but >>>>>> even make it very very very very big it does not affect sink >>>>>> perfomances), >>>>>> so it seems to me that the problem is related to sink. >>>>>> In order to test this point I've also tried to change the source >>>>>> using "exec" type and simply executing "cat myfile" but the result >>>>>> hasn't >>>>>> changed.... >>>>>> >>>>>> >>>>>> Here's my used config... >>>>>> >>>>>> # list the sources, sinks and channels for the agent >>>>>> test.sources = r1 >>>>>> test.channels = c1 >>>>>> test.sinks = s1 >>>>>> >>>>>> # exec attempt >>>>>> test.sources.r1.type = exec >>>>>> test.sources.r1.command = cat /tmp/myfile >>>>>> >>>>>> # my netcat attempt >>>>>> #test.sources.r1.type = netcat >>>>>> #test.sources.r1.bind = localhost >>>>>> #test.sources.r1.port = 6666 >>>>>> >>>>>> # my file channel attempt >>>>>> #test.channels.c1.type = file >>>>>> >>>>>> #my memory channel attempt >>>>>> test.channels.c1.type = memory >>>>>> test.channels.c1.capacity = 1000000 >>>>>> test.channels.c1.transactionCapacity = 10000 >>>>>> >>>>>> # how to properly set those parameter?? even if I enable those >>>>>> nothing changes >>>>>> # in my performances (what it the buffer percentage used for?) >>>>>> #test.channels.c1.byteCapacityBufferPercentage = 50 >>>>>> #test.channels.c1.byteCapacity = 100000000 >>>>>> >>>>>> # set channel for source >>>>>> test.sources.r1.channels = c1 >>>>>> # set channel for sink >>>>>> test.sinks.s1.channel = c1 >>>>>> >>>>>> test.sinks.s1.type = hdfs >>>>>> test.sinks.s1.hdfs.useLocalTimeStamp = true >>>>>> >>>>>> test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/* >>>>>> test.sinks.s1.hdfs.filePrefix = log-data >>>>>> test.sinks.s1.hdfs.inUseSuffix = .dat >>>>>> >>>>>> # how to set this parameter??? (i basically want to send as much >>>>>> data as I can) >>>>>> test.sinks.s1.hdfs.batchSize = 10000 >>>>>> >>>>>> #test.sinks.s1.hdfs.round = true >>>>>> #test.sinks.s1.hdfs.roundValue = 5 >>>>>> #test.sinks.s1.hdfs.roundUnit = minute >>>>>> >>>>>> test.sinks.s1.hdfs.rollSize = 0 >>>>>> test.sinks.s1.hdfs.rollCount = 0 >>>>>> test.sinks.s1.hdfs.rollInterval = 0 >>>>>> >>>>>> # compression attempt >>>>>> #test.sinks.s1.hdsf.fileType = CompressedStream >>>>>> #test.sinks.s1.hdfs.codeC=gzip >>>>>> #test.sinks.s1.hdfs.codeC=BZip2Codec >>>>>> #test.sinks.s1.hdfs.callTimeout = 120000 >>>>>> >>>>>> Can someone show me how to find this bottleneck/ configuration >>>>>> mistake? (I can't believe that those are flume performance on my machine) >>>>>> >>>>>> Thanks a lot if you can help me >>>>>> Regards. >>>>>> Sebastiano >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Thanks and regards >>>> Sandeep Khurana >>>> >>> >>> >> >> >> -- >> Thanks and regards >> Sandeep Khurana >> > >