Suiter,

I thought the solution that using cron job and hadoop commands. However, in my system, there are some sources (logsys, ...) that used Flume so I prefer Flume for consistency and useful features from Flume (ex: more sinks, roll count, ....)

Thanks,
Cuong LUU


On 25/10/2013 00:57, DSuiter RDX wrote:
Luu,

You might want to set up some redundant/load-balancing channels and sinks, so if one sink is tied up, the operation can be attempted on another sink. I am not very experienced with that arrangement yet, and so cannot guide you very much, but have seen that mentioned as a means to ensure delivery when there is too much going on. The source does not need to change, since it will replicate to any channels automatically and the sinks can get their own channels for their input.

I'm not certain that Flume is a good way to handle such a large file, it seems that Flume is designed to have many small files, and can aggregate them and such.

But, if the file you are uploading is in a place in local filesystem, can't you just use a cron entry to run "hadoop fs -put $FILE $HDFS/INPUT/PATH" into HDFS?

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com <http://www.rdx.com/>


On Thu, Oct 24, 2013 at 11:35 AM, ltcuong211 <[email protected] <mailto:[email protected]>> wrote:

    Hi Jeff & JS,

    I tried using spooling dir source & memory channel. It still takes
    ~ 4 minutes to copy 1gb data into hdfs.

    By the way, thanks for suggesting spooling source. I think it is
    better than exec + cat in my case.

    Cuong LUU

    On 21/10/2013 22:50, Jeff Lord wrote:
    Luu,

    Have you tried using the spooling directory source?

    -Jeff


    On Mon, Oct 21, 2013 at 3:25 AM, Cuong Luu <[email protected]
    <mailto:[email protected]>> wrote:

        Hi all,

        I need to copy data in a local directory (hadoop server) into
        hdfs regularly and automatically. This is my flume config:

        agent.sources = execSource
        agent.channels = fileChannel
        agent.sinks = hdfsSink

        agent.sources.execSource.type = exec

        agent.sources.execSource.shell = /bin/bash -c
        agent.sources.execSource.command = for i in /local-dir/*; do
        cat $i; done

        agent.sources.execSource.restart = true
        agent.sources.execSource.restartThrottle = 3600000
        agent.sources.execSource.batchSize = 100

        ...
        agent.sinks.hdfsSink.hdfs.rollInterval = 0
        agent.sinks.hdfsSink.hdfs.rollSize = 262144000
        agent.sinks.hdfsSink.hdfs.rollCount = 0
        agent.sinks.hdfsSink.batchsize = 100000
        ...
        agent.channels.fileChannel.type = FILE
        agent.channels.fileChannel.capacity = 100000
        ...

        while hadoop command takes 30second, Flume takes arround 4
        minutes to copy 1 gb text file into HDFS. I am worried about
        whether the config is not good or shouldn't use flume in this
        case?

        How about your opinion?





Reply via email to