Re: AWS S3 flume source

Jonathan Natkins Wed, 06 Aug 2014 10:31:06 -0700

Adding the dev list to the discussion


On Wed, Aug 6, 2014 at 9:37 AM, Jonathan Natkins <na...@streamsets.com>
wrote:

> Ashish, I've put some comments inline.
>
>
> On Tuesday, August 5, 2014, Ashish <paliwalash...@gmail.com> wrote:
>
>> Sharing some random thoughts
>>
>> 1. Download the file using S3 SDK and let the SpoolDirectory
>> implementation take care of rest. Like a Decorator in front of
>> SpoolDirectory
>>
>> This works for the simple case, but I don't think this is an ideal
> solution. My primary concern is that S3's max file size is 5TB, so
> downloading the object to local disk may not be possible.
>
>
>> 2. Use S3 SDK to create InputStream of S3 objects directly in code and
>> create events out of it.
>>
>> Would be great to reuse an existing implementation which is based on
>> InputStream and feed it with S3 object input stream, concern of metadata
>> storage still remains. Most often S3 objects are stored in compressed form,
>> so this source would need to take care of compression gz/avro/others.
>>
>> Best is to start with something that works and then start adding more
>> features to it.
>>
>>
>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I started trying to write some code on this, and realized there are a
>>> number of issues that need to be discussed in order to really design this
>>> feature effectively. The requirements that have been discussed thus far are:
>>>
>>> 1. Fetching data from S3 periodically
>>> 2. Fetching data from multiple S3 buckets -- This may be something that
>>> should be punted on until later. For a first implementation, this could be
>>> solved just by having multiple sources, each with a single S3 bucket
>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>>> clarify what you mean by this?*
>>> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491,
>>> so I think this is out-of-scope for discussions at the moment
>>>
>>> Some questions I want to try to answer:
>>>
>>> 1. How do we identify and track objects that need to be processed versus
>>> objects that have been processed already?
>>> 1a. What about if we want to have multiple sources working against the
>>> same bucket to speed processing?
>>> 2. Is it fair to assume that we're dealing with character files, rather
>>> than binary objects?
>>>
>>>  For the first question, if we ignore the multiple source extension of
>>> the question, I think the simplest answer is to do something on the local
>>> filesystem, like have a tracking directory that contains a list of
>>> to-be-processed objects and a list of already-processed objects. However,
>>> if the source goes down, what should the restart semantics be? It seems
>>> that the ideal situation is to store this state in a system like ZooKeeper,
>>> which would ensure that a number of sources could operate off of the same
>>> bucket, but this probably requires FLUME-1491 first.
>>>
>>> For the second question, my feeling was just that we should work with
>>> similar assumptions to how the SpoolingDirectorySource works, where each
>>> line is a separate event. Does that seem reasonable?
>>>
>>> Thanks,
>>> Natty
>>>
>>>
>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pro...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> Thanks for explanation Jonathan. I think I will also start working on
>>>> it. When you have any patch (even draft) I'd be glad if you can attach it
>>>> in JIRA. I'll do the same.
>>>> What do you think?
>>>>
>>>> --
>>>> Paweł Róg
>>>>
>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hshreedha...@cloudera.com>
>>>> :
>>>>
>>>> +1 on an S3 Source. I would gladly review.
>>>>>
>>>>> Jonathan Natkins wrote:
>>>>>
>>>>>
>>>>> Hey Pawel,
>>>>>
>>>>> My intention is to start working on it, but I don't know exactly how
>>>>> long it will take, and I'm not a committer, so time estimates would
>>>>> have to be taken with a grain of salt regardless. If this is something
>>>>> that you need urgently, it may not be ideal to wait for me to start
>>>>> building something for yourself.
>>>>>
>>>>> That said, as mentioned in the other thread, dynamic configuration can
>>>>> be done by refreshing the configuration files across the set of Flume
>>>>> agents. It's certainly not as great as having a single place to change
>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>
>>>>> Thanks,
>>>>> Natty
>>>>>
>>>>>
>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <pro...@gmail.com
>>>>> <mailto:pro...@gmail.com>> wrote:
>>>>>
>>>>>     Hi,
>>>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>>>     JIRA issue and want to start implementing this and do you have any
>>>>>     estimate how long it will take?
>>>>>
>>>>>     I think the biggest challenge here is to have dynamic
>>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>>     issue. Am I right?
>>>>>
>>>>>     > Would you need to be able to pull files from multiple S3
>>>>>     directories with the same source?
>>>>>
>>>>>     I think we don't need to track multiple S3 buckets with a single
>>>>>     source. I just imagine an approach where each S3 source can be
>>>>>     added or deleted on demand and attached to any Channel. I'm only
>>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>>     about this. It seems we have two totally separate things:
>>>>>     * build S3 source
>>>>>     * make flume configurable dynamically
>>>>>
>>>>>     --
>>>>>     Paweł
>>>>>
>>>>>
>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>     <otis.gospodne...@gmail.com <mailto:otis.gospodne...@gmail.com>>:
>>>>>
>>>>>
>>>>>         Hi,
>>>>>
>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>         <na...@streamsets.com <mailto:na...@streamsets.com>> wrote:
>>>>>
>>>>>             Hey all,
>>>>>
>>>>>             I created a JIRA for this:
>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>
>>>>>
>>>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>>>         version?
>>>>>
>>>>>             I thought I'd start working on one myself, which can
>>>>>             hopefully be contributed back. I'm curious: do you have
>>>>>             particular requirements? Based on the emails in this
>>>>>             thread, it sounds like the original goal was to have
>>>>>             something that's like a SpoolDirectorySource that just
>>>>>             picks up new files from S3. Is that accurate?
>>>>>
>>>>>
>>>>>         Yes, I think so.  We need to be able to:
>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>>
>>>>>         every 1 min, every 5 min, etc.)
>>>>>         * fetch data from multiple S3 buckets
>>>>>         * associate an S3 bucket with a user/token/key
>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>         stored on disk) add new S3 buckets from which data should be
>>>>> fetch
>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>
>>>>>
>>>>>             Would you need to be able to pull files from multiple S3
>>>>>             directories with the same source?
>>>>>
>>>>>
>>>>>         I think the above addresses this question.
>>>>>
>>>>>             Thanks,
>>>>>             Natty
>>>>>
>>>>>
>>>>>         Thanks!
>>>>>
>>>>>         Otis
>>>>>         --
>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>>
>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>             <otis.gospodne...@gmail.com
>>>>>             <mailto:otis.gospodne...@gmail.com>> wrote:
>>>>>
>>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>>
>>>>>                 But being able to dynamically add/remove S3 buckets
>>>>>                 from which to pull data seems important.
>>>>>
>>>>>                 Any suggestions for how to approach that?
>>>>>
>>>>>                 Otis
>>>>>                 --
>>>>>                 Performance Monitoring * Log Analytics * Search
>>>>> Analytics
>>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>                 <hshreedha...@cloudera.com
>>>>>                 <mailto:hshreedha...@cloudera.com>> wrote:
>>>>>
>>>>>                     Please go ahead and file a jira. If you are
>>>>>                     willing to submit a patch, you can post it on the
>>>>>                     jira.
>>>>>
>>>>>                     Viral Bajaria wrote:
>>>>>
>>>>>
>>>>>
>>>>>                     I have a similar use case that cropped up
>>>>>                     yesterday. I saw the archive
>>>>>                     and found that there was a recommendation to
>>>>>                     build it as Sharninder
>>>>>                     suggested.
>>>>>
>>>>>                     For now, I went down the route of writing a
>>>>>                     python script which
>>>>>                     downloads from S3 and puts the files in a
>>>>>                     directory which is
>>>>>                     configured to be picked up via a spooldir.
>>>>>
>>>>>                     I would prefer to get a direct S3 source, and
>>>>>                     maybe we could
>>>>>                     collaborate on it and open-source it. Let me know
>>>>>                     if you prefer that
>>>>>                     and we can work directly on it by creating a JIRA.
>>>>>
>>>>>                     Thanks,
>>>>>                     Viral
>>>>>
>>>>>
>>>>>
>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>>>                     <hshreedha...@cloudera.com
>>>>>                     <mailto:hshreedha...@cloudera.com>
>>>>>                     <mailto:hshreedha...@cloudera.com
>>>>>
>>>>>                     <mailto:hshreedha...@cloudera.com>>> wrote:
>>>>>
>>>>>                         In both cases, Sharninder is right :)
>>>>>
>>>>>                         Sharninder wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                         As far as I know, there is no (open source)
>>>>>                     implementation of an S3
>>>>>                         source, so yes, you'll have to implement
>>>>>                     your own. You'll have to
>>>>>                         implement a Pollable source and the dev
>>>>>                     documentation has an outline
>>>>>                         that you can use. You can also look at the
>>>>>                     existing Execsource and
>>>>>                         work your way up.
>>>>>
>>>>>                         As far as I know, there is no way to
>>>>>                     configure flume without
>>>>>                         using the
>>>>>                         configuration file.
>>>>>
>>>>>
>>>>>
>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>>                     <pro...@gmail.com <mailto:pro...@gmail.com>
>>>>>                     <mailto:pro...@gmail.com <mailto:pro...@gmail.com
>>>>> >>
>>>>>                     <mailto:pro...@gmail.com
>>>>>                     <mailto:pro...@gmail.com>
>>>>>                     <mailto:pro...@gmail.com
>>>>>                     <mailto:pro...@gmail.com>>>> wrote:
>>>>>
>>>>>                             Hi,
>>>>>                             I'm wondering if Flume is able to read
>>>>>                     directly from S3.
>>>>>
>>>>>                             I'll describe my case. I have log files
>>>>>                     stored in AWS S3. I have
>>>>>                             to fetch periodically new S3 objects and
>>>>>                     read log lines from it.
>>>>>                             Than use log lines (events) are
>>>>>                     processed in standard flume's way
>>>>>                             (as with other sources).
>>>>>
>>>>>                             *1) Is there any way to fetch S3 objects
>>>>>                     or I have to write
>>>>>                         my own
>>>>>                             Source?*
>>>>>
>>>>>
>>>>>                             There is also second case. I want to
>>>>>                     have flume configuration
>>>>>                             dynamic. Flume sources can change in
>>>>>                     time. New AWS key and S3
>>>>>                             bucket can be added or deleted.
>>>>>
>>>>>                             *2) Is there any other way to configure
>>>>>                     Flume than by static
>>>>>                             configuration file?*
>>>>>
>>>>>                             --
>>>>>                             Paweł Róg
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>

Re: AWS S3 flume source

Reply via email to