Re: AWS S3 flume source

Hari Shreedharan Fri, 01 Aug 2014 11:20:06 -0700

+1 on an S3 Source. I would gladly review.

Jonathan Natkins wrote:


Hey Pawel,

My intention is to start working on it, but I don't know exactly how
long it will take, and I'm not a committer, so time estimates would
have to be taken with a grain of salt regardless. If this is something
that you need urgently, it may not be ideal to wait for me to start
building something for yourself.

That said, as mentioned in the other thread, dynamic configuration can
be done by refreshing the configuration files across the set of Flume
agents. It's certainly not as great as having a single place to change
it (e.g. ZooKeeper), but it's a way to get the job done.

Thanks,
Natty


On Fri, Aug 1, 2014 at 1:33 AM, Paweł <pro...@gmail.com
<mailto:pro...@gmail.com>> wrote:

Hi,
Jonathan how should we interpret your last e-mail? You opened an
JIRA issue and want to start implementing this and do you have any
estimate how long it will take?

I think the biggest challenge here is to have dynamic
configuration of Flume. It doesn't seem to be part of FLUME-2437
issue. Am I right?

> Would you need to be able to pull files from multiple S3
directories with the same source?

I think we don't need to track multiple S3 buckets with a single
source. I just imagine an approach where each S3 source can be
added or deleted on demand and attached to any Channel. I'm only
afraid about this dynamic configuration. I'll open a new thread
about this. It seems we have two totally separate things:
* build S3 source
* make flume configurable dynamically

--
Paweł


2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
<otis.gospodne...@gmail.com <mailto:otis.gospodne...@gmail.com>>:

Hi,

On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
<na...@streamsets.com <mailto:na...@streamsets.com>> wrote:

Hey all,

I created a JIRA for this:
https://issues.apache.org/jira/browse/FLUME-2437


Thanks! Should Fix Version be set to the next Flume release
version?

I thought I'd start working on one myself, which can
hopefully be contributed back. I'm curious: do you have
particular requirements? Based on the emails in this
thread, it sounds like the original goal was to have
something that's like a SpoolDirectorySource that just
picks up new files from S3. Is that accurate?


Yes, I think so. We need to be able to:
* fetch data (logs for pulling them in Logsene
<http://sematext.com/logsene/>) from S3 periodically (e.g.
every 1 min, every 5 min, etc.)
* fetch data from multiple S3 buckets
* associate an S3 bucket with a user/token/key
* dynamically (i.e. without editing/writing config files
stored on disk) add new S3 buckets from which data should be fetch
* dynamically (i.e. without editing/writing config files
stored on disk) stop fetching data from some S3 buckets


Would you need to be able to pull files from multiple S3
directories with the same source?


I think the above addresses this question.

Thanks,
Natty


Thanks!

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/



On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
<otis.gospodne...@gmail.com
<mailto:otis.gospodne...@gmail.com>> wrote:

+1 for seeing S3Source, starting with a JIRA issue.

But being able to dynamically add/remove S3 buckets
from which to pull data seems important.

Any suggestions for how to approach that?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
<hshreedha...@cloudera.com
<mailto:hshreedha...@cloudera.com>> wrote:

Please go ahead and file a jira. If you are
willing to submit a patch, you can post it on the
jira.

Viral Bajaria wrote:



I have a similar use case that cropped up
yesterday. I saw the archive
and found that there was a recommendation to
build it as Sharninder
suggested.

For now, I went down the route of writing a
python script which
downloads from S3 and puts the files in a
directory which is
configured to be picked up via a spooldir.

I would prefer to get a direct S3 source, and
maybe we could
collaborate on it and open-source it. Let me know
if you prefer that
and we can work directly on it by creating a JIRA.

Thanks,
Viral



On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
<hshreedha...@cloudera.com
<mailto:hshreedha...@cloudera.com>
<mailto:hshreedha...@cloudera.com
<mailto:hshreedha...@cloudera.com>>> wrote:

In both cases, Sharninder is right :)

Sharninder wrote:




As far as I know, there is no (open source)
implementation of an S3
source, so yes, you'll have to implement
your own. You'll have to
implement a Pollable source and the dev
documentation has an outline
that you can use. You can also look at the
existing Execsource and
work your way up.

As far as I know, there is no way to
configure flume without
using the
configuration file.



On Thu, Jul 31, 2014 at 7:57 PM, Paweł
<pro...@gmail.com <mailto:pro...@gmail.com>
<mailto:pro...@gmail.com <mailto:pro...@gmail.com>>
<mailto:pro...@gmail.com
<mailto:pro...@gmail.com>
<mailto:pro...@gmail.com
<mailto:pro...@gmail.com>>>> wrote:

Hi,
I'm wondering if Flume is able to read
directly from S3.

I'll describe my case. I have log files
stored in AWS S3. I have
to fetch periodically new S3 objects and
read log lines from it.
Than use log lines (events) are
processed in standard flume's way
(as with other sources).

*1) Is there any way to fetch S3 objects
or I have to write
my own
Source?*


There is also second case. I want to
have flume configuration
dynamic. Flume sources can change in
time. New AWS key and S3
bucket can be added or deleted.

*2) Is there any other way to configure
Flume than by static
configuration file?*

--
Paweł Róg

Re: AWS S3 flume source

Reply via email to