Hi GIntas, Do you happen to have a config file we could examine?
Based on the scenario you have described, I'm thinking of a "multi-agent flow" ( http://flume.apache.org/FlumeUserGuide.html#setting-multi-agent-flow ) with a Kafka source at the "start" of the first agent and a HDFS sink at the "end" of the last agent. You can scale this system as you like (ie. single-node to a whole cluster). The interceptor approach sounds also possible. Do you have any performance concerns or are you just looking for an elegant way to implement a solution? Thanks, Donat 2017-09-05 14:00 GMT+02:00 Gintautas Sulskus <gintautas.suls...@gmail.com>: > Hi, > > I have a question regarding Flume suitability for a particular use case. > > Task: There is an incoming constant stream of links that point to files. > Those files to be fetched and stored in HDFS. > > Desired implementation: > > 1. Each link to a file is stored in Kafka queue Q1. > 2. Flume A1.source monitors Q1 for new links. > 3. Upon retrieving a link from Q1, A1.source fetches the file. The file > eventually is stored in HDFS by A1.sink > > My concern here is a seemingly overloaded functionality of A1.source. The > A1.source would have to perform two activities: 1.) to periodically poll > queue Q1 for new links to files and then 2.) fetch those files. > > What do you think? Is there a cleaner way to achieve this, e.g. by using an > interceptor to fetch files? Would this be appropriate? > > Best, > GIntas