Hi GIntas,

Do you happen to have a config file we could examine?

Based on the scenario you have described, I'm thinking of a
"multi-agent flow" (
http://flume.apache.org/FlumeUserGuide.html#setting-multi-agent-flow )
with a Kafka source at the "start" of the first agent and a HDFS sink
at the "end" of the last agent. You can scale this system as you like
(ie. single-node to a whole cluster).
The interceptor approach sounds also possible.

Do you have any performance concerns or are you just looking for an
elegant way to implement a solution?


Thanks,

Donat


2017-09-05 14:00 GMT+02:00 Gintautas Sulskus <gintautas.suls...@gmail.com>:
> Hi,
>
> I have a question regarding Flume suitability for a particular use case.
>
> Task: There is an incoming constant stream of links that point to files.
> Those files to be fetched and stored in HDFS.
>
> Desired implementation:
>
> 1. Each link to a file is stored in Kafka queue Q1.
> 2. Flume A1.source monitors Q1 for new links.
> 3. Upon retrieving a link from Q1, A1.source fetches the file. The file
> eventually is stored in HDFS by A1.sink
>
> My concern here is a seemingly overloaded functionality of A1.source. The
> A1.source would have to perform two activities: 1.) to periodically poll
> queue Q1 for new links to files and then  2.) fetch those files.
>
> What do you think? Is there a cleaner way to achieve this, e.g. by using an
> interceptor to fetch files? Would this be appropriate?
>
> Best,
> GIntas

Reply via email to