Re: AWS S3 flume source

2014-08-11 Thread Ashish
On Mon, Aug 11, 2014 at 4:04 PM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > Hi, > > On Wed, Aug 6, 2014 at 5:04 AM, Ashish wrote: > >> Sharing some random thoughts >> >> 1. Download the file using S3 SDK and let the SpoolDirectory >> implementation take care of rest. Like a Decorator

Re: AWS S3 flume source

2014-08-11 Thread Otis Gospodnetic
Hi, On Wed, Aug 6, 2014 at 5:04 AM, Ashish wrote: > Sharing some random thoughts > > 1. Download the file using S3 SDK and let the SpoolDirectory > implementation take care of rest. Like a Decorator in front of > SpoolDirectory > My worry is that using SpoolDirectory requires temporary writes t

Re: AWS S3 flume source

2014-08-11 Thread Otis Gospodnetic
Hi, On Tue, Aug 5, 2014 at 10:57 PM, Jonathan Natkins wrote: > Hi all, > > I started trying to write some code on this, and realized there are a > number of issues that need to be discussed in order to really design this > feature effectively. The requirements that have been discussed thus far a

Re: AWS S3 flume source

2014-08-11 Thread Otis Gospodnetic
I was thinking the same. I think the store (DB, FS, ZK, something else) used to track state (what's been read from S3, what's been processed, etc.) would ideally be abstract/extensible. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematex

Re: AWS S3 flume source

2014-08-11 Thread Ashish
May be best is not to depend on Zk directly. create some sort of abstraction which can use Zk, DB or some other mechanism to share the distributed state. How about keeping the distributed state out of picture till we have a working S3 source, and plugin the meta-data information piece to it later.

Re: AWS S3 flume source

2014-08-11 Thread Jonathan Natkins
Yeah, I realize that. The reason I think it should be somewhat dependent upon FLUME-1491 is that ZooKeeper seems to me to be a pretty heavy-weight requirement just to use a particular source. FLUME-1491 would make Flume generally dependent upon ZooKeeper, which is a good transition point to start u

Re: AWS S3 flume source

2014-08-10 Thread Ashish
Seems like a bit of confusion here. Flume-1491 only deals with configuration part, nothing else. Even if it get integrated, you would still need to write/expose API to store meta-data info in Zk (Flume-1491 doesn't bring that in). HTH ! On Mon, Aug 11, 2014 at 11:39 AM, Jonathan Natkins wrote:

Re: AWS S3 flume source

2014-08-10 Thread Jonathan Natkins
Given that FLUME-1491 hasn't been committed yet, and may still be a ways away, does it seem reasonable to punt on having multiple sources working off of a single bucket until ZK is integrated into Flume? The alternative probably requires write access to the S3 bucket to record some shared state, an

Re: AWS S3 flume source

2014-08-06 Thread Jonathan Natkins
Adding the dev list to the discussion On Wed, Aug 6, 2014 at 9:37 AM, Jonathan Natkins wrote: > Ashish, I've put some comments inline. > > > On Tuesday, August 5, 2014, Ashish wrote: > >> Sharing some random thoughts >> >> 1. Download the file using S3 SDK and let the SpoolDirectory >> impleme

Re: AWS S3 flume source

2014-08-06 Thread Jonathan Natkins
Ashish, I've put some comments inline. On Tuesday, August 5, 2014, Ashish wrote: > Sharing some random thoughts > > 1. Download the file using S3 SDK and let the SpoolDirectory > implementation take care of rest. Like a Decorator in front of > SpoolDirectory > > This works for the simple case, b

Re: AWS S3 flume source

2014-08-05 Thread Paweł Róg
Hi, I think that it is not possible to simply use SpoolDirectorySource. Maybe it will be possible to use some elements of SpoolDirectory but without touching it's code I think SpoolDirectory is not a good base. At the very beginning SpoolDirectorySource does this: File directory = new File(spoolD

Re: AWS S3 flume source

2014-08-05 Thread Viral Bajaria
Agree to the feedback provided by Ashish. I have started writing one which is similar to the ExecSource, but I like the idea of doing something where spooldir takes over most of the hard work of spitting out events to sinks. Let me think more on how to structure that. Quick thinking out loud, I c

Re: AWS S3 flume source

2014-08-05 Thread Ashish
Sharing some random thoughts 1. Download the file using S3 SDK and let the SpoolDirectory implementation take care of rest. Like a Decorator in front of SpoolDirectory 2. Use S3 SDK to create InputStream of S3 objects directly in code and create events out of it. Would be great to reuse an exist

Re: AWS S3 flume source

2014-08-05 Thread Jonathan Natkins
Hi all, I started trying to write some code on this, and realized there are a number of issues that need to be discussed in order to really design this feature effectively. The requirements that have been discussed thus far are: 1. Fetching data from S3 periodically 2. Fetching data from multiple

Re: AWS S3 flume source

2014-08-01 Thread Paweł
Hi, Thanks for explanation Jonathan. I think I will also start working on it. When you have any patch (even draft) I'd be glad if you can attach it in JIRA. I'll do the same. What do you think? -- Paweł Róg 2014-08-01 20:19 GMT+02:00 Hari Shreedharan : > +1 on an S3 Source. I would gladly review

Re: AWS S3 flume source

2014-08-01 Thread Hari Shreedharan
+1 on an S3 Source. I would gladly review. Jonathan Natkins wrote: Hey Pawel, My intention is to start working on it, but I don't know exactly how long it will take, and I'm not a committer, so time estimates would have to be taken with a grain of salt regardless. If this is something that you

Re: AWS S3 flume source

2014-08-01 Thread Jonathan Natkins
Hey Pawel, My intention is to start working on it, but I don't know exactly how long it will take, and I'm not a committer, so time estimates would have to be taken with a grain of salt regardless. If this is something that you need urgently, it may not be ideal to wait for me to start building so

Re: AWS S3 flume source

2014-08-01 Thread Paweł
Hi, Jonathan how should we interpret your last e-mail? You opened an JIRA issue and want to start implementing this and do you have any estimate how long it will take? I think the biggest challenge here is to have dynamic configuration of Flume. It doesn't seem to be part of FLUME-2437 issue. Am I

Re: AWS S3 flume source

2014-08-01 Thread Otis Gospodnetic
Hi, On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins wrote: > Hey all, > > I created a JIRA for this: > https://issues.apache.org/jira/browse/FLUME-2437 > Thanks! Should Fix Version be set to the next Flume release version? I thought I'd start working on one myself, which can hopefully be > c

Re: AWS S3 flume source

2014-07-31 Thread Jonathan Natkins
Hey all, I created a JIRA for this: https://issues.apache.org/jira/browse/FLUME-2437 I thought I'd start working on one myself, which can hopefully be contributed back. I'm curious: do you have particular requirements? Based on the emails in this thread, it sounds like the original goal was to ha

Re: AWS S3 flume source

2014-07-31 Thread Otis Gospodnetic
+1 for seeing S3Source, starting with a JIRA issue. But being able to dynamically add/remove S3 buckets from which to pull data seems important. Any suggestions for how to approach that? Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://semat

Re: AWS S3 flume source

2014-07-31 Thread Hari Shreedharan
Please go ahead and file a jira. If you are willing to submit a patch, you can post it on the jira. Viral Bajaria wrote: I have a similar use case that cropped up yesterday. I saw the archive and found that there was a recommendation to build it as Sharninder suggested. For now, I went down t

Re: AWS S3 flume source

2014-07-31 Thread Viral Bajaria
I have a similar use case that cropped up yesterday. I saw the archive and found that there was a recommendation to build it as Sharninder suggested. For now, I went down the route of writing a python script which downloads from S3 and puts the files in a directory which is configured to be picked

Re: AWS S3 flume source

2014-07-31 Thread Hari Shreedharan
In both cases, Sharninder is right :) Sharninder wrote: As far as I know, there is no (open source) implementation of an S3 source, so yes, you'll have to implement your own. You'll have to implement a Pollable source and the dev documentation has an outline that you can use. You can also look

Re: AWS S3 flume source

2014-07-31 Thread Sharninder
As far as I know, there is no (open source) implementation of an S3 source, so yes, you'll have to implement your own. You'll have to implement a Pollable source and the dev documentation has an outline that you can use. You can also look at the existing Execsource and work your way up. As far as