Start adding additional HDFS sinks attached to the same channel. You can also 
tune batch sizes when writing to HDFS to increase per sink performance.

On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" 
<sebastiano.dipa...@gmail.com<mailto:sebastiano.dipa...@gmail.com>> wrote:

Hi there,
I'm a completely newbie of Flume, so I probably made a mistake in my 
configuration but I cannot point it out.
I want to achieve transfer maximum performances.
My flume machine has 16GB RAM and 8 Cores
I'm using a very simple Flume architecture:
Source -> Memory Channel -> Sink
Source is of type netcat
and Sink is hdfs
The machine has 1Gb ethernet directly connected to the switch of the hadoop 
cluster.
The point is that Flume is sooo slow in loading the data into my hdfs 
filesystem.
(i.e. using hdfs dfs -copyFromLocal myfile /flume/events/myfile from the same 
machine I will reach approx 250 Mb/s as transfer rate, while transferring the 
same file with this Flume architecture is like 2-3 Mb/s). (the cluster is 
composed of 10 machines, and was totally idle while I did this test, so was not 
under stress) (the traffic rate was measured on the flume machine output 
interface in both exeperiments)
(myfile has 10 million of lines of average size of 150 bytes each)

For what I understood till now It doesn't seem a source issue as the memory 
channel tends to fill up if I decrease the channel capacity (but even make it 
very very very very big it does not affect sink perfomances), so it seems to me 
that the problem is related to sink.
In order to test this point I've also tried to change the source using "exec" 
type and simply executing "cat myfile"  but the result hasn't changed....


Here's my used config...

 # list the sources, sinks and channels for the agent
test.sources = r1
test.channels = c1
test.sinks = s1

# exec attempt
test.sources.r1.type = exec
test.sources.r1.command = cat /tmp/myfile

# my netcat attempt
#test.sources.r1.type = netcat
#test.sources.r1.bind = localhost
#test.sources.r1.port = 6666

# my file channel attempt
#test.channels.c1.type = file

#my memory channel attempt
test.channels.c1.type = memory
test.channels.c1.capacity = 1000000
test.channels.c1.transactionCapacity = 10000

# how to properly set those parameter?? even if I enable those nothing changes
# in my performances (what it the buffer percentage used for?)
#test.channels.c1.byteCapacityBufferPercentage = 50
#test.channels.c1.byteCapacity = 100000000

# set channel for source
test.sources.r1.channels = c1
# set channel for sink
test.sinks.s1.channel = c1

test.sinks.s1.type = hdfs
test.sinks.s1.hdfs.useLocalTimeStamp = true

test.sinks.s1.hdfs.path = hdfs://mynodemanager:9000/flume/events/
test.sinks.s1.hdfs.filePrefix = log-data
test.sinks.s1.hdfs.inUseSuffix = .dat

# how to set this parameter??? (i basically want to send as much data as I can)
test.sinks.s1.hdfs.batchSize = 10000

#test.sinks.s1.hdfs.round = true
#test.sinks.s1.hdfs.roundValue = 5
#test.sinks.s1.hdfs.roundUnit = minute

test.sinks.s1.hdfs.rollSize = 0
test.sinks.s1.hdfs.rollCount = 0
test.sinks.s1.hdfs.rollInterval = 0

# compression attempt
#test.sinks.s1.hdsf.fileType = CompressedStream
#test.sinks.s1.hdfs.codeC=gzip
#test.sinks.s1.hdfs.codeC=BZip2Codec
#test.sinks.s1.hdfs.callTimeout = 120000

Can someone show me how to find this bottleneck/ configuration mistake? (I 
can't believe that those are flume performance on my machine)

Thanks a lot if you can help me
Regards.
Sebastiano

Reply via email to