Re: AWS S3 flume source

Otis Gospodnetic Mon, 11 Aug 2014 03:35:10 -0700

Hi,

On Wed, Aug 6, 2014 at 5:04 AM, Ashish <paliwalash...@gmail.com> wrote:


> Sharing some random thoughts
>
> 1. Download the file using S3 SDK and let the SpoolDirectory
> implementation take care of rest. Like a Decorator in front of
> SpoolDirectory
>

My worry is that using SpoolDirectory requires temporary writes to the FS
and if you are using Flume to process a lot of data, then any large amounts
of data to disk will slow things down quite a bit.

But maybe there is no way of avoiding disk anyway because of Flume's
checkpointing and other parts that write to disk already?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


2. Use S3 SDK to create InputStream of S3 objects directly in code and
> create events out of it.
>
> Would be great to reuse an existing implementation which is based on
> InputStream and feed it with S3 object input stream, concern of metadata
> storage still remains. Most often S3 objects are stored in compressed form,
> so this source would need to take care of compression gz/avro/others.
>
> Best is to start with something that works and then start adding more
> features to it.
>
>
> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com>
> wrote:
>
>> Hi all,
>>
>> I started trying to write some code on this, and realized there are a
>> number of issues that need to be discussed in order to really design this
>> feature effectively. The requirements that have been discussed thus far are:
>>
>> 1. Fetching data from S3 periodically
>> 2. Fetching data from multiple S3 buckets -- This may be something that
>> should be punted on until later. For a first implementation, this could be
>> solved just by having multiple sources, each with a single S3 bucket
>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>> clarify what you mean by this?*
>> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491,
>> so I think this is out-of-scope for discussions at the moment
>>
>> Some questions I want to try to answer:
>>
>> 1. How do we identify and track objects that need to be processed versus
>> objects that have been processed already?
>> 1a. What about if we want to have multiple sources working against the
>> same bucket to speed processing?
>> 2. Is it fair to assume that we're dealing with character files, rather
>> than binary objects?
>>
>>  For the first question, if we ignore the multiple source extension of
>> the question, I think the simplest answer is to do something on the local
>> filesystem, like have a tracking directory that contains a list of
>> to-be-processed objects and a list of already-processed objects. However,
>> if the source goes down, what should the restart semantics be? It seems
>> that the ideal situation is to store this state in a system like ZooKeeper,
>> which would ensure that a number of sources could operate off of the same
>> bucket, but this probably requires FLUME-1491 first.
>>
>> For the second question, my feeling was just that we should work with
>> similar assumptions to how the SpoolingDirectorySource works, where each
>> line is a separate event. Does that seem reasonable?
>>
>> Thanks,
>> Natty
>>
>>
>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pro...@gmail.com> wrote:
>>
>>> Hi,
>>> Thanks for explanation Jonathan. I think I will also start working on
>>> it. When you have any patch (even draft) I'd be glad if you can attach it
>>> in JIRA. I'll do the same.
>>> What do you think?
>>>
>>> --
>>> Paweł Róg
>>>
>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hshreedha...@cloudera.com>:
>>>
>>> +1 on an S3 Source. I would gladly review.
>>>>
>>>> Jonathan Natkins wrote:
>>>>
>>>>
>>>> Hey Pawel,
>>>>
>>>> My intention is to start working on it, but I don't know exactly how
>>>> long it will take, and I'm not a committer, so time estimates would
>>>> have to be taken with a grain of salt regardless. If this is something
>>>> that you need urgently, it may not be ideal to wait for me to start
>>>> building something for yourself.
>>>>
>>>> That said, as mentioned in the other thread, dynamic configuration can
>>>> be done by refreshing the configuration files across the set of Flume
>>>> agents. It's certainly not as great as having a single place to change
>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>
>>>> Thanks,
>>>> Natty
>>>>
>>>>
>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <pro...@gmail.com
>>>> <mailto:pro...@gmail.com>> wrote:
>>>>
>>>>     Hi,
>>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>>     JIRA issue and want to start implementing this and do you have any
>>>>     estimate how long it will take?
>>>>
>>>>     I think the biggest challenge here is to have dynamic
>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>     issue. Am I right?
>>>>
>>>>     > Would you need to be able to pull files from multiple S3
>>>>     directories with the same source?
>>>>
>>>>     I think we don't need to track multiple S3 buckets with a single
>>>>     source. I just imagine an approach where each S3 source can be
>>>>     added or deleted on demand and attached to any Channel. I'm only
>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>     about this. It seems we have two totally separate things:
>>>>     * build S3 source
>>>>     * make flume configurable dynamically
>>>>
>>>>     --
>>>>     Paweł
>>>>
>>>>
>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>     <otis.gospodne...@gmail.com <mailto:otis.gospodne...@gmail.com>>:
>>>>
>>>>
>>>>         Hi,
>>>>
>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>         <na...@streamsets.com <mailto:na...@streamsets.com>> wrote:
>>>>
>>>>             Hey all,
>>>>
>>>>             I created a JIRA for this:
>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>
>>>>
>>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>>         version?
>>>>
>>>>             I thought I'd start working on one myself, which can
>>>>             hopefully be contributed back. I'm curious: do you have
>>>>             particular requirements? Based on the emails in this
>>>>             thread, it sounds like the original goal was to have
>>>>             something that's like a SpoolDirectorySource that just
>>>>             picks up new files from S3. Is that accurate?
>>>>
>>>>
>>>>         Yes, I think so.  We need to be able to:
>>>>         * fetch data (logs for pulling them in Logsene
>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>
>>>>         every 1 min, every 5 min, etc.)
>>>>         * fetch data from multiple S3 buckets
>>>>         * associate an S3 bucket with a user/token/key
>>>>         * dynamically (i.e. without editing/writing config files
>>>>         stored on disk) add new S3 buckets from which data should be
>>>> fetch
>>>>         * dynamically (i.e. without editing/writing config files
>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>
>>>>
>>>>             Would you need to be able to pull files from multiple S3
>>>>             directories with the same source?
>>>>
>>>>
>>>>         I think the above addresses this question.
>>>>
>>>>             Thanks,
>>>>             Natty
>>>>
>>>>
>>>>         Thanks!
>>>>
>>>>         Otis
>>>>         --
>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>>
>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>             <otis.gospodne...@gmail.com
>>>>             <mailto:otis.gospodne...@gmail.com>> wrote:
>>>>
>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>
>>>>                 But being able to dynamically add/remove S3 buckets
>>>>                 from which to pull data seems important.
>>>>
>>>>                 Any suggestions for how to approach that?
>>>>
>>>>                 Otis
>>>>                 --
>>>>                 Performance Monitoring * Log Analytics * Search
>>>> Analytics
>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>                 <hshreedha...@cloudera.com
>>>>                 <mailto:hshreedha...@cloudera.com>> wrote:
>>>>
>>>>                     Please go ahead and file a jira. If you are
>>>>                     willing to submit a patch, you can post it on the
>>>>                     jira.
>>>>
>>>>                     Viral Bajaria wrote:
>>>>
>>>>
>>>>
>>>>                     I have a similar use case that cropped up
>>>>                     yesterday. I saw the archive
>>>>                     and found that there was a recommendation to
>>>>                     build it as Sharninder
>>>>                     suggested.
>>>>
>>>>                     For now, I went down the route of writing a
>>>>                     python script which
>>>>                     downloads from S3 and puts the files in a
>>>>                     directory which is
>>>>                     configured to be picked up via a spooldir.
>>>>
>>>>                     I would prefer to get a direct S3 source, and
>>>>                     maybe we could
>>>>                     collaborate on it and open-source it. Let me know
>>>>                     if you prefer that
>>>>                     and we can work directly on it by creating a JIRA.
>>>>
>>>>                     Thanks,
>>>>                     Viral
>>>>
>>>>
>>>>
>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>>                     <hshreedha...@cloudera.com
>>>>                     <mailto:hshreedha...@cloudera.com>
>>>>                     <mailto:hshreedha...@cloudera.com
>>>>
>>>>                     <mailto:hshreedha...@cloudera.com>>> wrote:
>>>>
>>>>                         In both cases, Sharninder is right :)
>>>>
>>>>                         Sharninder wrote:
>>>>
>>>>
>>>>
>>>>
>>>>                         As far as I know, there is no (open source)
>>>>                     implementation of an S3
>>>>                         source, so yes, you'll have to implement
>>>>                     your own. You'll have to
>>>>                         implement a Pollable source and the dev
>>>>                     documentation has an outline
>>>>                         that you can use. You can also look at the
>>>>                     existing Execsource and
>>>>                         work your way up.
>>>>
>>>>                         As far as I know, there is no way to
>>>>                     configure flume without
>>>>                         using the
>>>>                         configuration file.
>>>>
>>>>
>>>>
>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>                     <pro...@gmail.com <mailto:pro...@gmail.com>
>>>>                     <mailto:pro...@gmail.com <mailto:pro...@gmail.com>>
>>>>                     <mailto:pro...@gmail.com
>>>>                     <mailto:pro...@gmail.com>
>>>>                     <mailto:pro...@gmail.com
>>>>                     <mailto:pro...@gmail.com>>>> wrote:
>>>>
>>>>                             Hi,
>>>>                             I'm wondering if Flume is able to read
>>>>                     directly from S3.
>>>>
>>>>                             I'll describe my case. I have log files
>>>>                     stored in AWS S3. I have
>>>>                             to fetch periodically new S3 objects and
>>>>                     read log lines from it.
>>>>                             Than use log lines (events) are
>>>>                     processed in standard flume's way
>>>>                             (as with other sources).
>>>>
>>>>                             *1) Is there any way to fetch S3 objects
>>>>                     or I have to write
>>>>                         my own
>>>>                             Source?*
>>>>
>>>>
>>>>                             There is also second case. I want to
>>>>                     have flume configuration
>>>>                             dynamic. Flume sources can change in
>>>>                     time. New AWS key and S3
>>>>                             bucket can be added or deleted.
>>>>
>>>>                             *2) Is there any other way to configure
>>>>                     Flume than by static
>>>>                             configuration file?*
>>>>
>>>>                             --
>>>>                             Paweł Róg
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: AWS S3 flume source

Reply via email to