Re: AWS S3 flume source

Otis Gospodnetic Mon, 11 Aug 2014 03:28:29 -0700

I was thinking the same.  I think the store (DB, FS, ZK, something else)
used to track state (what's been read from S3, what's been processed, etc.)
would ideally be abstract/extensible.


Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Mon, Aug 11, 2014 at 9:33 AM, Ashish <paliwalash...@gmail.com> wrote:

> May be best is not to depend on Zk directly. create some sort of
> abstraction which can use Zk, DB or some other mechanism to share the
> distributed state. How about keeping the distributed state out of picture
> till we have a working S3 source, and plugin the meta-data information
> piece to it later. It can store the state locally like SpoolingDirectory
> Source.
>
> wdyt?
>
>
> On Mon, Aug 11, 2014 at 12:53 PM, Jonathan Natkins <na...@streamsets.com>
> wrote:
>
>> Yeah, I realize that. The reason I think it should be somewhat dependent
>> upon FLUME-1491 is that ZooKeeper seems to me to be a pretty heavy-weight
>> requirement just to use a particular source. FLUME-1491 would make Flume
>> generally dependent upon ZooKeeper, which is a good transition point to
>> start using ZK for other state that would be necessary for Flume
>> components. Would you agree?
>>
>>
>> On Sun, Aug 10, 2014 at 11:35 PM, Ashish <paliwalash...@gmail.com> wrote:
>>
>>> Seems like a bit of confusion here. Flume-1491 only deals with
>>> configuration part, nothing else. Even if it get integrated, you would
>>> still need to write/expose API to store meta-data info in Zk (Flume-1491
>>> doesn't bring that in).
>>>
>>> HTH !
>>>
>>>
>>> On Mon, Aug 11, 2014 at 11:39 AM, Jonathan Natkins <na...@streamsets.com
>>> > wrote:
>>>
>>>> Given that FLUME-1491 hasn't been committed yet, and may still be a
>>>> ways away, does it seem reasonable to punt on having multiple sources
>>>> working off of a single bucket until ZK is integrated into Flume? The
>>>> alternative probably requires write access to the S3 bucket to record some
>>>> shared state, and would likely have to get rewritten once ZK integration
>>>> happens anyway.
>>>>
>>>>
>>>> On Tue, Aug 5, 2014 at 10:07 PM, Paweł Róg <pro...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I think that it is not possible to simply use SpoolDirectorySource.
>>>>> Maybe it will be possible to use some elements of SpoolDirectory but
>>>>> without touching it's code I think SpoolDirectory is not a good base. At
>>>>> the very beginning SpoolDirectorySource does this:
>>>>>
>>>>> File directory = new File(spoolDirectory);
>>>>>
>>>>> ReliableSpoolingFileEventReader also instantiate File class.
>>>>> There is also a question. How ReliableSpoolingFileEventReader stores
>>>>> information about files that has been already processed in non-Deleting
>>>>> mode? What happens after Flume restart?
>>>>>
>>>>> I agree with Jonathan that S3 source should be able to store last
>>>>> processed file eg. in Zookeeper.
>>>>> Another thing Jonathan: I think you shouldn't care about multiple
>>>>> buckets processed handled by a single S3Source. As you wrote multiple
>>>>> sources is the solution here. I thought it was already discussed but maybe
>>>>> I'm wrong.
>>>>>
>>>>>
>>>>> >> 2. Is it fair to assume that we're dealing with character files,
>>>>> rather than binary objects?
>>>>>
>>>>> In my opinion S3 source can by default read file as simple text file
>>>>> but also take in configuration a parameter with class name of a
>>>>> "InputStream processor". This processor will we able to eg. unzip,
>>>>> deserialize avro or read JSON and convert it into log events. What do you
>>>>> think?
>>>>>
>>>>> --
>>>>> Paweł Róg
>>>>>
>>>>> 2014-08-06 5:12 GMT+02:00 Viral Bajaria <viral.baja...@gmail.com>:
>>>>>
>>>>> Agree to the feedback provided by Ashish.
>>>>>>
>>>>>> I have started writing one which is similar to the ExecSource, but I
>>>>>> like the idea of doing something where spooldir takes over most of the 
>>>>>> hard
>>>>>> work of spitting out events to sinks. Let me think more on how to 
>>>>>> structure
>>>>>> that.
>>>>>>
>>>>>> Quick thinking out loud, I could create a source which extends the
>>>>>> spooldir and just spins off a thread to manage moving things from S3 to 
>>>>>> the
>>>>>> spooldir via a temporary directory.
>>>>>>
>>>>>> Regarding maintaining metadata, there are 2 ways:
>>>>>> 1) DB: I currently maintain it in a database because there are a lot
>>>>>> of other tools build around it
>>>>>> 2) File: Just keep the info in memory and in file to help from crash
>>>>>> recovery and/or high memory usage.
>>>>>>
>>>>>> Thanks,
>>>>>> Viral
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 5, 2014 at 8:04 PM, Ashish <paliwalash...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Sharing some random thoughts
>>>>>>>
>>>>>>> 1. Download the file using S3 SDK and let the SpoolDirectory
>>>>>>> implementation take care of rest. Like a Decorator in front of
>>>>>>> SpoolDirectory
>>>>>>>
>>>>>>> 2. Use S3 SDK to create InputStream of S3 objects directly in code
>>>>>>> and create events out of it.
>>>>>>>
>>>>>>> Would be great to reuse an existing implementation which is based on
>>>>>>> InputStream and feed it with S3 object input stream, concern of metadata
>>>>>>> storage still remains. Most often S3 objects are stored in compressed 
>>>>>>> form,
>>>>>>> so this source would need to take care of compression gz/avro/others.
>>>>>>>
>>>>>>> Best is to start with something that works and then start adding
>>>>>>> more features to it.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <
>>>>>>> na...@streamsets.com> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I started trying to write some code on this, and realized there are
>>>>>>>> a number of issues that need to be discussed in order to really design 
>>>>>>>> this
>>>>>>>> feature effectively. The requirements that have been discussed thus 
>>>>>>>> far are:
>>>>>>>>
>>>>>>>> 1. Fetching data from S3 periodically
>>>>>>>> 2. Fetching data from multiple S3 buckets -- This may be something
>>>>>>>> that should be punted on until later. For a first implementation, this
>>>>>>>> could be solved just by having multiple sources, each with a single S3
>>>>>>>> bucket
>>>>>>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can
>>>>>>>> you clarify what you mean by this?*
>>>>>>>> 4. Dynamically reconfigure the source -- This is blocked by
>>>>>>>> FLUME-1491, so I think this is out-of-scope for discussions at the 
>>>>>>>> moment
>>>>>>>>
>>>>>>>> Some questions I want to try to answer:
>>>>>>>>
>>>>>>>> 1. How do we identify and track objects that need to be processed
>>>>>>>> versus objects that have been processed already?
>>>>>>>> 1a. What about if we want to have multiple sources working against
>>>>>>>> the same bucket to speed processing?
>>>>>>>> 2. Is it fair to assume that we're dealing with character files,
>>>>>>>> rather than binary objects?
>>>>>>>>
>>>>>>>>  For the first question, if we ignore the multiple source
>>>>>>>> extension of the question, I think the simplest answer is to do 
>>>>>>>> something
>>>>>>>> on the local filesystem, like have a tracking directory that contains a
>>>>>>>> list of to-be-processed objects and a list of already-processed 
>>>>>>>> objects.
>>>>>>>> However, if the source goes down, what should the restart semantics 
>>>>>>>> be? It
>>>>>>>> seems that the ideal situation is to store this state in a system like
>>>>>>>> ZooKeeper, which would ensure that a number of sources could operate 
>>>>>>>> off of
>>>>>>>> the same bucket, but this probably requires FLUME-1491 first.
>>>>>>>>
>>>>>>>> For the second question, my feeling was just that we should work
>>>>>>>> with similar assumptions to how the SpoolingDirectorySource works, 
>>>>>>>> where
>>>>>>>> each line is a separate event. Does that seem reasonable?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Natty
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pro...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> Thanks for explanation Jonathan. I think I will also start working
>>>>>>>>> on it. When you have any patch (even draft) I'd be glad if you can 
>>>>>>>>> attach
>>>>>>>>> it in JIRA. I'll do the same.
>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Paweł Róg
>>>>>>>>>
>>>>>>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <
>>>>>>>>> hshreedha...@cloudera.com>:
>>>>>>>>>
>>>>>>>>> +1 on an S3 Source. I would gladly review.
>>>>>>>>>>
>>>>>>>>>> Jonathan Natkins wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hey Pawel,
>>>>>>>>>>
>>>>>>>>>> My intention is to start working on it, but I don't know exactly
>>>>>>>>>> how
>>>>>>>>>> long it will take, and I'm not a committer, so time estimates
>>>>>>>>>> would
>>>>>>>>>> have to be taken with a grain of salt regardless. If this is
>>>>>>>>>> something
>>>>>>>>>> that you need urgently, it may not be ideal to wait for me to
>>>>>>>>>> start
>>>>>>>>>> building something for yourself.
>>>>>>>>>>
>>>>>>>>>> That said, as mentioned in the other thread, dynamic
>>>>>>>>>> configuration can
>>>>>>>>>> be done by refreshing the configuration files across the set of
>>>>>>>>>> Flume
>>>>>>>>>> agents. It's certainly not as great as having a single place to
>>>>>>>>>> change
>>>>>>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Natty
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <pro...@gmail.com
>>>>>>>>>> <mailto:pro...@gmail.com>> wrote:
>>>>>>>>>>
>>>>>>>>>>     Hi,
>>>>>>>>>>     Jonathan how should we interpret your last e-mail? You opened
>>>>>>>>>> an
>>>>>>>>>>     JIRA issue and want to start implementing this and do you
>>>>>>>>>> have any
>>>>>>>>>>     estimate how long it will take?
>>>>>>>>>>
>>>>>>>>>>     I think the biggest challenge here is to have dynamic
>>>>>>>>>>     configuration of Flume. It doesn't seem to be part of
>>>>>>>>>> FLUME-2437
>>>>>>>>>>     issue. Am I right?
>>>>>>>>>>
>>>>>>>>>>     > Would you need to be able to pull files from multiple S3
>>>>>>>>>>     directories with the same source?
>>>>>>>>>>
>>>>>>>>>>     I think we don't need to track multiple S3 buckets with a
>>>>>>>>>> single
>>>>>>>>>>     source. I just imagine an approach where each S3 source can be
>>>>>>>>>>     added or deleted on demand and attached to any Channel. I'm
>>>>>>>>>> only
>>>>>>>>>>     afraid about this dynamic configuration. I'll open a new
>>>>>>>>>> thread
>>>>>>>>>>     about this. It seems we have two totally separate things:
>>>>>>>>>>     * build S3 source
>>>>>>>>>>     * make flume configurable dynamically
>>>>>>>>>>
>>>>>>>>>>     --
>>>>>>>>>>     Paweł
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>>>>>>     <otis.gospodne...@gmail.com <mailto:
>>>>>>>>>> otis.gospodne...@gmail.com>>:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Hi,
>>>>>>>>>>
>>>>>>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>>>>>>         <na...@streamsets.com <mailto:na...@streamsets.com>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>             Hey all,
>>>>>>>>>>
>>>>>>>>>>             I created a JIRA for this:
>>>>>>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Thanks!  Should Fix Version be set to the next Flume
>>>>>>>>>> release
>>>>>>>>>>         version?
>>>>>>>>>>
>>>>>>>>>>             I thought I'd start working on one myself, which can
>>>>>>>>>>             hopefully be contributed back. I'm curious: do you
>>>>>>>>>> have
>>>>>>>>>>             particular requirements? Based on the emails in this
>>>>>>>>>>             thread, it sounds like the original goal was to have
>>>>>>>>>>             something that's like a SpoolDirectorySource that just
>>>>>>>>>>             picks up new files from S3. Is that accurate?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Yes, I think so.  We need to be able to:
>>>>>>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>>>>>>         <http://sematext.com/logsene/>) from S3 periodically
>>>>>>>>>> (e.g.
>>>>>>>>>>
>>>>>>>>>>         every 1 min, every 5 min, etc.)
>>>>>>>>>>         * fetch data from multiple S3 buckets
>>>>>>>>>>         * associate an S3 bucket with a user/token/key
>>>>>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>>>>>         stored on disk) add new S3 buckets from which data should
>>>>>>>>>> be fetch
>>>>>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>             Would you need to be able to pull files from multiple
>>>>>>>>>> S3
>>>>>>>>>>             directories with the same source?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         I think the above addresses this question.
>>>>>>>>>>
>>>>>>>>>>             Thanks,
>>>>>>>>>>             Natty
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Thanks!
>>>>>>>>>>
>>>>>>>>>>         Otis
>>>>>>>>>>         --
>>>>>>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>>>>>>             <otis.gospodne...@gmail.com
>>>>>>>>>>             <mailto:otis.gospodne...@gmail.com>> wrote:
>>>>>>>>>>
>>>>>>>>>>                 +1 for seeing S3Source, starting with a JIRA
>>>>>>>>>> issue.
>>>>>>>>>>
>>>>>>>>>>                 But being able to dynamically add/remove S3
>>>>>>>>>> buckets
>>>>>>>>>>                 from which to pull data seems important.
>>>>>>>>>>
>>>>>>>>>>                 Any suggestions for how to approach that?
>>>>>>>>>>
>>>>>>>>>>                 Otis
>>>>>>>>>>                 --
>>>>>>>>>>                 Performance Monitoring * Log Analytics * Search
>>>>>>>>>> Analytics
>>>>>>>>>>                 Solr & Elasticsearch Support *
>>>>>>>>>> http://sematext.com/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>>>>>>                 <hshreedha...@cloudera.com
>>>>>>>>>>                 <mailto:hshreedha...@cloudera.com>> wrote:
>>>>>>>>>>
>>>>>>>>>>                     Please go ahead and file a jira. If you are
>>>>>>>>>>                     willing to submit a patch, you can post it on
>>>>>>>>>> the
>>>>>>>>>>                     jira.
>>>>>>>>>>
>>>>>>>>>>                     Viral Bajaria wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     I have a similar use case that cropped up
>>>>>>>>>>                     yesterday. I saw the archive
>>>>>>>>>>                     and found that there was a recommendation to
>>>>>>>>>>                     build it as Sharninder
>>>>>>>>>>                     suggested.
>>>>>>>>>>
>>>>>>>>>>                     For now, I went down the route of writing a
>>>>>>>>>>                     python script which
>>>>>>>>>>                     downloads from S3 and puts the files in a
>>>>>>>>>>                     directory which is
>>>>>>>>>>                     configured to be picked up via a spooldir.
>>>>>>>>>>
>>>>>>>>>>                     I would prefer to get a direct S3 source, and
>>>>>>>>>>                     maybe we could
>>>>>>>>>>                     collaborate on it and open-source it. Let me
>>>>>>>>>> know
>>>>>>>>>>                     if you prefer that
>>>>>>>>>>                     and we can work directly on it by creating a
>>>>>>>>>> JIRA.
>>>>>>>>>>
>>>>>>>>>>                     Thanks,
>>>>>>>>>>                     Viral
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari
>>>>>>>>>> Shreedharan
>>>>>>>>>>                     <hshreedha...@cloudera.com
>>>>>>>>>>                     <mailto:hshreedha...@cloudera.com>
>>>>>>>>>>                     <mailto:hshreedha...@cloudera.com
>>>>>>>>>>
>>>>>>>>>>                     <mailto:hshreedha...@cloudera.com>>> wrote:
>>>>>>>>>>
>>>>>>>>>>                         In both cases, Sharninder is right :)
>>>>>>>>>>
>>>>>>>>>>                         Sharninder wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                         As far as I know, there is no (open
>>>>>>>>>> source)
>>>>>>>>>>                     implementation of an S3
>>>>>>>>>>                         source, so yes, you'll have to implement
>>>>>>>>>>                     your own. You'll have to
>>>>>>>>>>                         implement a Pollable source and the dev
>>>>>>>>>>                     documentation has an outline
>>>>>>>>>>                         that you can use. You can also look at the
>>>>>>>>>>                     existing Execsource and
>>>>>>>>>>                         work your way up.
>>>>>>>>>>
>>>>>>>>>>                         As far as I know, there is no way to
>>>>>>>>>>                     configure flume without
>>>>>>>>>>                         using the
>>>>>>>>>>                         configuration file.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>>>>>>>                     <pro...@gmail.com <mailto:pro...@gmail.com>
>>>>>>>>>>                     <mailto:pro...@gmail.com <mailto:
>>>>>>>>>> pro...@gmail.com>>
>>>>>>>>>>                     <mailto:pro...@gmail.com
>>>>>>>>>>                     <mailto:pro...@gmail.com>
>>>>>>>>>>                     <mailto:pro...@gmail.com
>>>>>>>>>>                     <mailto:pro...@gmail.com>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>                             Hi,
>>>>>>>>>>                             I'm wondering if Flume is able to read
>>>>>>>>>>                     directly from S3.
>>>>>>>>>>
>>>>>>>>>>                             I'll describe my case. I have log
>>>>>>>>>> files
>>>>>>>>>>                     stored in AWS S3. I have
>>>>>>>>>>                             to fetch periodically new S3 objects
>>>>>>>>>> and
>>>>>>>>>>                     read log lines from it.
>>>>>>>>>>                             Than use log lines (events) are
>>>>>>>>>>                     processed in standard flume's way
>>>>>>>>>>                             (as with other sources).
>>>>>>>>>>
>>>>>>>>>>                             *1) Is there any way to fetch S3
>>>>>>>>>> objects
>>>>>>>>>>                     or I have to write
>>>>>>>>>>                         my own
>>>>>>>>>>                             Source?*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                             There is also second case. I want to
>>>>>>>>>>                     have flume configuration
>>>>>>>>>>                             dynamic. Flume sources can change in
>>>>>>>>>>                     time. New AWS key and S3
>>>>>>>>>>                             bucket can be added or deleted.
>>>>>>>>>>
>>>>>>>>>>                             *2) Is there any other way to
>>>>>>>>>> configure
>>>>>>>>>>                     Flume than by static
>>>>>>>>>>                             configuration file?*
>>>>>>>>>>
>>>>>>>>>>                             --
>>>>>>>>>>                             Paweł Róg
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> thanks
>>>>>>> ashish
>>>>>>>
>>>>>>> Blog: http://www.ashishpaliwal.com/blog
>>>>>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> thanks
>>> ashish
>>>
>>> Blog: http://www.ashishpaliwal.com/blog
>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>
>>
>>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: AWS S3 flume source

Reply via email to