Which was your final configuration? What speed did you get?

2014-09-03 11:18 GMT+02:00 Sebastiano Di Paola <sebastiano.dipa...@gmail.com
>:

> I raised batchSize of a 100 factor,  added more heap space and speed
> increased...
> still not reached the same speed as using "hdfs dfs -copyFromLocal" but
> I'm pretty sure it's a tuning problem.
> thanks a lot for your hint.
> Regards
> Seba
>
>
> On Wed, Sep 3, 2014 at 9:55 AM, Sandeep Khurana <skhurana...@gmail.com>
> wrote:
>
>> Since you mentioned "average size of 150 bytes each" is your each
>> record, I will try increasing the batch size to a higher value.
>>
>>
>> "HDFS batch size determines the number of events to take from the
>> channel and send in one go."
>>
>> So in 1 shot you are sending 1500000 bytes to hdfs.
>>
>>
>> On Wed, Sep 3, 2014 at 1:18 PM, Sebastiano Di Paola <
>> sebastiano.dipa...@gmail.com> wrote:
>>
>>> In my experiment, I just want to transfer a single file...just to test
>>> what performances I can achieve...
>>> so rolling file on hdfs at this point is not vital.
>>> Anyway I did some test rolling file every 300 seconds.
>>> What I can't explain to myself is the "slow" output from the sink...the
>>> memory channel overflows (if it's not big enough so it seems that the souce
>>> is able to produce a higher data rate than the data rate the sink is able
>>> to process and send on hdfs)
>>> I'm not sure if it can helps to pinpoint my "configuration mistake", but
>>> I'm using Flume 1.5.0.1 (tried also Flume 1.5.0)
>>> Regards.
>>> Seba
>>>
>>>
>>> On Wed, Sep 3, 2014 at 9:38 AM, Sandeep Khurana <skhurana...@gmail.com>
>>> wrote:
>>>
>>>> I see that you have below settings set to zero. You dont want rolling
>>>> to hdfs to  happen based upon any of the size, count or time  interval?
>>>>
>>>> test.sinks.s1.hdfs.rollSize = 0
>>>> test.sinks.s1.hdfs.rollCount = 0
>>>> test.sinks.s1.hdfs.rollInterval = 0
>>>>
>>>>
>>>> On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola <
>>>> sebastiano.dipa...@gmail.com> wrote:
>>>>
>>>>> Hi Paul,
>>>>> thank for your answer.
>>>>> As I' m a newbie of Flume How can I attach multiple sinks to the same
>>>>> channel? (does they read data in a round robin fashon from the memory
>>>>> channel?)
>>>>>  (does this create multiple files on the hdfs?, because this is not
>>>>> what I'm expecting to have I have a 500MB data file at the source and I
>>>>> would like to have only one file on HDFS)
>>>>>
>>>>> I can't believe that I cannot achieve such a performance with a single
>>>>> sink. I'm pretty sure it's a configuration issue!
>>>>> Beside this how to tune the batchSize parameter? (Of course I have
>>>>> already tried to set it like 10 times the number I have in my config, but
>>>>> no relevant improvements)
>>>>> Regards.
>>>>> Seba
>>>>>
>>>>>
>>>>> On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pcha...@ntent.com> wrote:
>>>>>
>>>>>>  Start adding additional HDFS sinks attached to the same channel.
>>>>>> You can also tune batch sizes when writing to HDFS to increase per sink
>>>>>> performance.
>>>>>>
>>>>>> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <
>>>>>> sebastiano.dipa...@gmail.com> wrote:
>>>>>>
>>>>>>   Hi there,
>>>>>> I'm a completely newbie of Flume, so I probably made a mistake in my
>>>>>> configuration but I cannot point it out.
>>>>>> I want to achieve transfer maximum performances.
>>>>>> My flume machine has 16GB RAM and 8 Cores
>>>>>> I'm using a very simple Flume architecture:
>>>>>> Source -> Memory Channel -> Sink
>>>>>> Source is of type netcat
>>>>>> and Sink is hdfs
>>>>>> The machine has 1Gb ethernet directly connected to the switch of the
>>>>>> hadoop cluster.
>>>>>> The point is that Flume is sooo slow in loading the data into my hdfs
>>>>>> filesystem.
>>>>>> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile
>>>>>> from the same machine I will reach approx 250 Mb/s as transfer rate, 
>>>>>> while
>>>>>> transferring the same file with this Flume architecture is like 2-3 
>>>>>> Mb/s).
>>>>>> (the cluster is composed of 10 machines, and was totally idle while I did
>>>>>> this test, so was not under stress) (the traffic rate was measured on the
>>>>>> flume machine output interface in both exeperiments)
>>>>>> (myfile has 10 million of lines of average size of 150 bytes each)
>>>>>>
>>>>>>  For what I understood till now It doesn't seem a source issue as
>>>>>> the memory channel tends to fill up if I decrease the channel capacity 
>>>>>> (but
>>>>>> even make it very very very very big it does not affect sink 
>>>>>> perfomances),
>>>>>> so it seems to me that the problem is related to sink.
>>>>>> In order to test this point I've also tried to change the source
>>>>>> using "exec" type and simply executing "cat myfile"  but the result 
>>>>>> hasn't
>>>>>> changed....
>>>>>>
>>>>>>
>>>>>>  Here's my used config...
>>>>>>
>>>>>>   # list the sources, sinks and channels for the agent
>>>>>> test.sources = r1
>>>>>> test.channels = c1
>>>>>>  test.sinks = s1
>>>>>>
>>>>>>  # exec attempt
>>>>>> test.sources.r1.type = exec
>>>>>> test.sources.r1.command = cat /tmp/myfile
>>>>>>
>>>>>>  # my netcat attempt
>>>>>> #test.sources.r1.type = netcat
>>>>>> #test.sources.r1.bind = localhost
>>>>>> #test.sources.r1.port = 6666
>>>>>>
>>>>>>  # my file channel attempt
>>>>>> #test.channels.c1.type = file
>>>>>>
>>>>>> #my memory channel attempt
>>>>>> test.channels.c1.type = memory
>>>>>> test.channels.c1.capacity = 1000000
>>>>>> test.channels.c1.transactionCapacity = 10000
>>>>>>
>>>>>>  # how to properly set those parameter?? even if I enable those
>>>>>> nothing changes
>>>>>> # in my performances (what it the buffer percentage used for?)
>>>>>> #test.channels.c1.byteCapacityBufferPercentage = 50
>>>>>> #test.channels.c1.byteCapacity = 100000000
>>>>>>
>>>>>>  # set channel for source
>>>>>> test.sources.r1.channels = c1
>>>>>> # set channel for sink
>>>>>> test.sinks.s1.channel = c1
>>>>>>
>>>>>>  test.sinks.s1.type = hdfs
>>>>>> test.sinks.s1.hdfs.useLocalTimeStamp = true
>>>>>>
>>>>>>  test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
>>>>>> test.sinks.s1.hdfs.filePrefix = log-data
>>>>>> test.sinks.s1.hdfs.inUseSuffix = .dat
>>>>>>
>>>>>>  # how to set this parameter??? (i basically want to send as much
>>>>>> data as I can)
>>>>>> test.sinks.s1.hdfs.batchSize = 10000
>>>>>>
>>>>>> #test.sinks.s1.hdfs.round = true
>>>>>> #test.sinks.s1.hdfs.roundValue = 5
>>>>>> #test.sinks.s1.hdfs.roundUnit = minute
>>>>>>
>>>>>> test.sinks.s1.hdfs.rollSize = 0
>>>>>> test.sinks.s1.hdfs.rollCount = 0
>>>>>> test.sinks.s1.hdfs.rollInterval = 0
>>>>>>
>>>>>> # compression attempt
>>>>>> #test.sinks.s1.hdsf.fileType = CompressedStream
>>>>>> #test.sinks.s1.hdfs.codeC=gzip
>>>>>> #test.sinks.s1.hdfs.codeC=BZip2Codec
>>>>>> #test.sinks.s1.hdfs.callTimeout = 120000
>>>>>>
>>>>>>  Can someone show me how to find this bottleneck/ configuration
>>>>>> mistake? (I can't believe that those are flume performance on my machine)
>>>>>>
>>>>>>  Thanks a lot if you can help me
>>>>>> Regards.
>>>>>> Sebastiano
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks and regards
>>>> Sandeep Khurana
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks and regards
>> Sandeep Khurana
>>
>
>

Reply via email to