Hi there,
I'm a completely newbie of Flume, so I probably made a mistake in my
configuration but I cannot point it out.
I want to achieve transfer maximum performances.
My flume machine has 16GB RAM and 8 Cores
I'm using a very simple Flume architecture:
Source -> Memory Channel -> Sink
Source is of type netcat
and Sink is hdfs
The machine has 1Gb ethernet directly connected to the switch of the hadoop
cluster.
The point is that Flume is sooo slow in loading the data into my hdfs
filesystem.
(i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from the
same machine I will reach approx 250 Mb/s as transfer rate, while
transferring the same file with this Flume architecture is like 2-3 Mb/s).
(the cluster is composed of 10 machines, and was totally idle while I did
this test, so was not under stress) (the traffic rate was measured on the
flume machine output interface in both exeperiments)
(myfile has 10 million of lines of average size of 150 bytes each)

For what I understood till now It doesn't seem a source issue as the memory
channel tends to fill up if I decrease the channel capacity (but even make
it very very very very big it does not affect sink perfomances), so it
seems to me that the problem is related to sink.
In order to test this point I've also tried to change the source using
"exec" type and simply executing "cat myfile"  but the result hasn't
changed....


Here's my used config...

 # list the sources, sinks and channels for the agent
test.sources = r1
test.channels = c1
test.sinks = s1

# exec attempt
test.sources.r1.type = exec
test.sources.r1.command = cat /tmp/myfile

# my netcat attempt
#test.sources.r1.type = netcat
#test.sources.r1.bind = localhost
#test.sources.r1.port = 6666

# my file channel attempt
#test.channels.c1.type = file

#my memory channel attempt
test.channels.c1.type = memory
test.channels.c1.capacity = 1000000
test.channels.c1.transactionCapacity = 10000

# how to properly set those parameter?? even if I enable those nothing
changes
# in my performances (what it the buffer percentage used for?)
#test.channels.c1.byteCapacityBufferPercentage = 50
#test.channels.c1.byteCapacity = 100000000

# set channel for source
test.sources.r1.channels = c1
# set channel for sink
test.sinks.s1.channel = c1

test.sinks.s1.type = hdfs
test.sinks.s1.hdfs.useLocalTimeStamp = true

test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
test.sinks.s1.hdfs.filePrefix = log-data
test.sinks.s1.hdfs.inUseSuffix = .dat

# how to set this parameter??? (i basically want to send as much data as I
can)
test.sinks.s1.hdfs.batchSize = 10000

#test.sinks.s1.hdfs.round = true
#test.sinks.s1.hdfs.roundValue = 5
#test.sinks.s1.hdfs.roundUnit = minute

test.sinks.s1.hdfs.rollSize = 0
test.sinks.s1.hdfs.rollCount = 0
test.sinks.s1.hdfs.rollInterval = 0

# compression attempt
#test.sinks.s1.hdsf.fileType = CompressedStream
#test.sinks.s1.hdfs.codeC=gzip
#test.sinks.s1.hdfs.codeC=BZip2Codec
#test.sinks.s1.hdfs.callTimeout = 120000

Can someone show me how to find this bottleneck/ configuration mistake? (I
can't believe that those are flume performance on my machine)

Thanks a lot if you can help me
Regards.
Sebastiano

Reply via email to