Re: Use case for Flume

Denes Arvay Tue, 05 Sep 2017 08:55:06 -0700

Hi GIntas,

What is the average (or expected maximum) size of the files you'd like to
process?
In general it is not recommended to transfer large events (i.e. >64MB if
you use file channel, as this is a hard limit of the protobuf
implementation).
If your files fit into this limit then I'd suggest to use an interceptor to
fetch the data and then update the event's body and push it through Flume.


In this case your setup would be:
Kafka source + data fetcher interceptor (custom code) -> file channel (or
memory) -> HDFS sink

If the files are larger then you could use a customised HDFS sink which
fetches the URL and stores the file on HDFS.
In this case I'd recommend to use a Kafka channel -> custom HDFS sink setup
without configuring any source.

Actually for your problem the sink-side interceptors would be a good
solution (https://issues.apache.org/jira/browse/FLUME-2580), but
unfortunately it is not implemented yet.

Regards,
Denes

On Tue, Sep 5, 2017 at 2:00 PM Gintautas Sulskus <
gintautas.suls...@gmail.com> wrote:

> Hi,
>
> I have a question regarding Flume suitability for a particular use case.
>
> Task: There is an incoming constant stream of links that point to files.
> Those files to be fetched and stored in HDFS.
>
> Desired implementation:
>
> 1. Each link to a file is stored in Kafka queue Q1.
> 2. Flume A1.source monitors Q1 for new links.
> 3. Upon retrieving a link from Q1, A1.source fetches the file. The file
> eventually is stored in HDFS by A1.sink
>
> My concern here is a seemingly overloaded functionality of A1.source. The
> A1.source would have to perform two activities: 1.) to periodically poll
> queue Q1 for new links to files and then  2.) fetch those files.
>
> What do you think? Is there a cleaner way to achieve this, e.g. by using
> an interceptor to fetch files? Would this be appropriate?
>
> Best,
> GIntas
>

Re: Use case for Flume

Reply via email to