Re: AWS S3 flume source

Paweł Fri, 01 Aug 2014 01:34:58 -0700

Hi,
Jonathan how should we interpret your last e-mail? You opened an JIRA issue
and want to start implementing this and do you have any estimate how long
it will take?


I think the biggest challenge here is to have dynamic configuration of
Flume. It doesn't seem to be part of FLUME-2437 issue. Am I right?

> Would you need to be able to pull files from multiple S3 directories with
the same source?

I think we don't need to track multiple S3 buckets with a single source. I
just imagine an approach where each S3 source can be added or deleted on
demand and attached to any Channel. I'm only afraid about this dynamic
configuration. I'll open a new thread about this. It seems we have two
totally separate things:
* build S3 source
* make flume configurable dynamically

--
Paweł


2014-08-01 9:51 GMT+02:00 Otis Gospodnetic <otis.gospodne...@gmail.com>:

> Hi,
>
> On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins <na...@streamsets.com>
> wrote:
>
>> Hey all,
>>
>> I created a JIRA for this:
>> https://issues.apache.org/jira/browse/FLUME-2437
>>
>
> Thanks!  Should Fix Version be set to the next Flume release version?
>
> I thought I'd start working on one myself, which can hopefully be
>> contributed back. I'm curious: do you have particular requirements? Based
>> on the emails in this thread, it sounds like the original goal was to have
>> something that's like a SpoolDirectorySource that just picks up new files
>> from S3. Is that accurate?
>>
>
> Yes, I think so.  We need to be able to:
> * fetch data (logs for pulling them in Logsene
> <http://sematext.com/logsene/>) from S3 periodically (e.g. every 1 min,
> every 5 min, etc.)
> * fetch data from multiple S3 buckets
> * associate an S3 bucket with a user/token/key
> * dynamically (i.e. without editing/writing config files stored on disk)
> add new S3 buckets from which data should be fetch
> * dynamically (i.e. without editing/writing config files stored on disk)
> stop fetching data from some S3 buckets
>
>
>> Would you need to be able to pull files from multiple S3 directories with
>> the same source?
>>
>
> I think the above addresses this question.
>
>
>> Thanks,
>> Natty
>>
>
> Thanks!
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>>
>>
>> On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic <
>> otis.gospodne...@gmail.com> wrote:
>>
>>> +1 for seeing S3Source, starting with a JIRA issue.
>>>
>>> But being able to dynamically add/remove S3 buckets from which to pull
>>> data seems important.
>>>
>>> Any suggestions for how to approach that?
>>>
>>> Otis
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>> On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan <
>>> hshreedha...@cloudera.com> wrote:
>>>
>>>> Please go ahead and file a jira. If you are willing to submit a patch,
>>>> you can post it on the jira.
>>>>
>>>> Viral Bajaria wrote:
>>>>
>>>>
>>>> I have a similar use case that cropped up yesterday. I saw the archive
>>>> and found that there was a recommendation to build it as Sharninder
>>>> suggested.
>>>>
>>>> For now, I went down the route of writing a python script which
>>>> downloads from S3 and puts the files in a directory which is
>>>> configured to be picked up via a spooldir.
>>>>
>>>> I would prefer to get a direct S3 source, and maybe we could
>>>> collaborate on it and open-source it. Let me know if you prefer that
>>>> and we can work directly on it by creating a JIRA.
>>>>
>>>> Thanks,
>>>> Viral
>>>>
>>>>
>>>>
>>>> On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>> <hshreedha...@cloudera.com <mailto:hshreedha...@cloudera.com>> wrote:
>>>>
>>>>     In both cases, Sharninder is right :)
>>>>
>>>>     Sharninder wrote:
>>>>
>>>>
>>>>
>>>>     As far as I know, there is no (open source) implementation of an S3
>>>>     source, so yes, you'll have to implement your own. You'll have to
>>>>     implement a Pollable source and the dev documentation has an outline
>>>>     that you can use. You can also look at the existing Execsource and
>>>>     work your way up.
>>>>
>>>>     As far as I know, there is no way to configure flume without
>>>>     using the
>>>>     configuration file.
>>>>
>>>>
>>>>
>>>>     On Thu, Jul 31, 2014 at 7:57 PM, Paweł <pro...@gmail.com
>>>>     <mailto:pro...@gmail.com>
>>>>     <mailto:pro...@gmail.com <mailto:pro...@gmail.com>>> wrote:
>>>>
>>>>         Hi,
>>>>         I'm wondering if Flume is able to read directly from S3.
>>>>
>>>>         I'll describe my case. I have log files stored in AWS S3. I have
>>>>         to fetch periodically new S3 objects and read log lines from it.
>>>>         Than use log lines (events) are processed in standard flume's
>>>> way
>>>>         (as with other sources).
>>>>
>>>>         *1) Is there any way to fetch S3 objects or I have to write
>>>>     my own
>>>>         Source?*
>>>>
>>>>
>>>>         There is also second case. I want to have flume configuration
>>>>         dynamic. Flume sources can change in time. New AWS key and S3
>>>>         bucket can be added or deleted.
>>>>
>>>>         *2) Is there any other way to configure Flume than by static
>>>>         configuration file?*
>>>>
>>>>         --
>>>>         Paweł Róg
>>>>
>>>>
>>>>
>>>
>>
>

Re: AWS S3 flume source

Reply via email to