Adding the dev list to the discussion
On Wed, Aug 6, 2014 at 9:37 AM, Jonathan Natkins <na...@streamsets.com> wrote: > Ashish, I've put some comments inline. > > > On Tuesday, August 5, 2014, Ashish <paliwalash...@gmail.com> wrote: > >> Sharing some random thoughts >> >> 1. Download the file using S3 SDK and let the SpoolDirectory >> implementation take care of rest. Like a Decorator in front of >> SpoolDirectory >> >> This works for the simple case, but I don't think this is an ideal > solution. My primary concern is that S3's max file size is 5TB, so > downloading the object to local disk may not be possible. > > >> 2. Use S3 SDK to create InputStream of S3 objects directly in code and >> create events out of it. >> >> Would be great to reuse an existing implementation which is based on >> InputStream and feed it with S3 object input stream, concern of metadata >> storage still remains. Most often S3 objects are stored in compressed form, >> so this source would need to take care of compression gz/avro/others. >> >> Best is to start with something that works and then start adding more >> features to it. >> >> >> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com> >> wrote: >> >>> Hi all, >>> >>> I started trying to write some code on this, and realized there are a >>> number of issues that need to be discussed in order to really design this >>> feature effectively. The requirements that have been discussed thus far are: >>> >>> 1. Fetching data from S3 periodically >>> 2. Fetching data from multiple S3 buckets -- This may be something that >>> should be punted on until later. For a first implementation, this could be >>> solved just by having multiple sources, each with a single S3 bucket >>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you >>> clarify what you mean by this?* >>> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491, >>> so I think this is out-of-scope for discussions at the moment >>> >>> Some questions I want to try to answer: >>> >>> 1. How do we identify and track objects that need to be processed versus >>> objects that have been processed already? >>> 1a. What about if we want to have multiple sources working against the >>> same bucket to speed processing? >>> 2. Is it fair to assume that we're dealing with character files, rather >>> than binary objects? >>> >>> For the first question, if we ignore the multiple source extension of >>> the question, I think the simplest answer is to do something on the local >>> filesystem, like have a tracking directory that contains a list of >>> to-be-processed objects and a list of already-processed objects. However, >>> if the source goes down, what should the restart semantics be? It seems >>> that the ideal situation is to store this state in a system like ZooKeeper, >>> which would ensure that a number of sources could operate off of the same >>> bucket, but this probably requires FLUME-1491 first. >>> >>> For the second question, my feeling was just that we should work with >>> similar assumptions to how the SpoolingDirectorySource works, where each >>> line is a separate event. Does that seem reasonable? >>> >>> Thanks, >>> Natty >>> >>> >>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pro...@gmail.com> wrote: >>> >>>> Hi, >>>> Thanks for explanation Jonathan. I think I will also start working on >>>> it. When you have any patch (even draft) I'd be glad if you can attach it >>>> in JIRA. I'll do the same. >>>> What do you think? >>>> >>>> -- >>>> Paweł Róg >>>> >>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hshreedha...@cloudera.com> >>>> : >>>> >>>> +1 on an S3 Source. I would gladly review. >>>>> >>>>> Jonathan Natkins wrote: >>>>> >>>>> >>>>> Hey Pawel, >>>>> >>>>> My intention is to start working on it, but I don't know exactly how >>>>> long it will take, and I'm not a committer, so time estimates would >>>>> have to be taken with a grain of salt regardless. If this is something >>>>> that you need urgently, it may not be ideal to wait for me to start >>>>> building something for yourself. >>>>> >>>>> That said, as mentioned in the other thread, dynamic configuration can >>>>> be done by refreshing the configuration files across the set of Flume >>>>> agents. It's certainly not as great as having a single place to change >>>>> it (e.g. ZooKeeper), but it's a way to get the job done. >>>>> >>>>> Thanks, >>>>> Natty >>>>> >>>>> >>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <pro...@gmail.com >>>>> <mailto:pro...@gmail.com>> wrote: >>>>> >>>>> Hi, >>>>> Jonathan how should we interpret your last e-mail? You opened an >>>>> JIRA issue and want to start implementing this and do you have any >>>>> estimate how long it will take? >>>>> >>>>> I think the biggest challenge here is to have dynamic >>>>> configuration of Flume. It doesn't seem to be part of FLUME-2437 >>>>> issue. Am I right? >>>>> >>>>> > Would you need to be able to pull files from multiple S3 >>>>> directories with the same source? >>>>> >>>>> I think we don't need to track multiple S3 buckets with a single >>>>> source. I just imagine an approach where each S3 source can be >>>>> added or deleted on demand and attached to any Channel. I'm only >>>>> afraid about this dynamic configuration. I'll open a new thread >>>>> about this. It seems we have two totally separate things: >>>>> * build S3 source >>>>> * make flume configurable dynamically >>>>> >>>>> -- >>>>> Paweł >>>>> >>>>> >>>>> 2014-08-01 9:51 GMT+02:00 Otis Gospodnetic >>>>> <otis.gospodne...@gmail.com <mailto:otis.gospodne...@gmail.com>>: >>>>> >>>>> >>>>> Hi, >>>>> >>>>> On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins >>>>> <na...@streamsets.com <mailto:na...@streamsets.com>> wrote: >>>>> >>>>> Hey all, >>>>> >>>>> I created a JIRA for this: >>>>> https://issues.apache.org/jira/browse/FLUME-2437 >>>>> >>>>> >>>>> Thanks! Should Fix Version be set to the next Flume release >>>>> version? >>>>> >>>>> I thought I'd start working on one myself, which can >>>>> hopefully be contributed back. I'm curious: do you have >>>>> particular requirements? Based on the emails in this >>>>> thread, it sounds like the original goal was to have >>>>> something that's like a SpoolDirectorySource that just >>>>> picks up new files from S3. Is that accurate? >>>>> >>>>> >>>>> Yes, I think so. We need to be able to: >>>>> * fetch data (logs for pulling them in Logsene >>>>> <http://sematext.com/logsene/>) from S3 periodically (e.g. >>>>> >>>>> every 1 min, every 5 min, etc.) >>>>> * fetch data from multiple S3 buckets >>>>> * associate an S3 bucket with a user/token/key >>>>> * dynamically (i.e. without editing/writing config files >>>>> stored on disk) add new S3 buckets from which data should be >>>>> fetch >>>>> * dynamically (i.e. without editing/writing config files >>>>> stored on disk) stop fetching data from some S3 buckets >>>>> >>>>> >>>>> Would you need to be able to pull files from multiple S3 >>>>> directories with the same source? >>>>> >>>>> >>>>> I think the above addresses this question. >>>>> >>>>> Thanks, >>>>> Natty >>>>> >>>>> >>>>> Thanks! >>>>> >>>>> Otis >>>>> -- >>>>> Performance Monitoring * Log Analytics * Search Analytics >>>>> Solr & Elasticsearch Support * http://sematext.com/ >>>>> >>>>> >>>>> >>>>> On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic >>>>> <otis.gospodne...@gmail.com >>>>> <mailto:otis.gospodne...@gmail.com>> wrote: >>>>> >>>>> +1 for seeing S3Source, starting with a JIRA issue. >>>>> >>>>> But being able to dynamically add/remove S3 buckets >>>>> from which to pull data seems important. >>>>> >>>>> Any suggestions for how to approach that? >>>>> >>>>> Otis >>>>> -- >>>>> Performance Monitoring * Log Analytics * Search >>>>> Analytics >>>>> Solr & Elasticsearch Support * http://sematext.com/ >>>>> >>>>> >>>>> On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan >>>>> <hshreedha...@cloudera.com >>>>> <mailto:hshreedha...@cloudera.com>> wrote: >>>>> >>>>> Please go ahead and file a jira. If you are >>>>> willing to submit a patch, you can post it on the >>>>> jira. >>>>> >>>>> Viral Bajaria wrote: >>>>> >>>>> >>>>> >>>>> I have a similar use case that cropped up >>>>> yesterday. I saw the archive >>>>> and found that there was a recommendation to >>>>> build it as Sharninder >>>>> suggested. >>>>> >>>>> For now, I went down the route of writing a >>>>> python script which >>>>> downloads from S3 and puts the files in a >>>>> directory which is >>>>> configured to be picked up via a spooldir. >>>>> >>>>> I would prefer to get a direct S3 source, and >>>>> maybe we could >>>>> collaborate on it and open-source it. Let me know >>>>> if you prefer that >>>>> and we can work directly on it by creating a JIRA. >>>>> >>>>> Thanks, >>>>> Viral >>>>> >>>>> >>>>> >>>>> On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan >>>>> <hshreedha...@cloudera.com >>>>> <mailto:hshreedha...@cloudera.com> >>>>> <mailto:hshreedha...@cloudera.com >>>>> >>>>> <mailto:hshreedha...@cloudera.com>>> wrote: >>>>> >>>>> In both cases, Sharninder is right :) >>>>> >>>>> Sharninder wrote: >>>>> >>>>> >>>>> >>>>> >>>>> As far as I know, there is no (open source) >>>>> implementation of an S3 >>>>> source, so yes, you'll have to implement >>>>> your own. You'll have to >>>>> implement a Pollable source and the dev >>>>> documentation has an outline >>>>> that you can use. You can also look at the >>>>> existing Execsource and >>>>> work your way up. >>>>> >>>>> As far as I know, there is no way to >>>>> configure flume without >>>>> using the >>>>> configuration file. >>>>> >>>>> >>>>> >>>>> On Thu, Jul 31, 2014 at 7:57 PM, Paweł >>>>> <pro...@gmail.com <mailto:pro...@gmail.com> >>>>> <mailto:pro...@gmail.com <mailto:pro...@gmail.com >>>>> >> >>>>> <mailto:pro...@gmail.com >>>>> <mailto:pro...@gmail.com> >>>>> <mailto:pro...@gmail.com >>>>> <mailto:pro...@gmail.com>>>> wrote: >>>>> >>>>> Hi, >>>>> I'm wondering if Flume is able to read >>>>> directly from S3. >>>>> >>>>> I'll describe my case. I have log files >>>>> stored in AWS S3. I have >>>>> to fetch periodically new S3 objects and >>>>> read log lines from it. >>>>> Than use log lines (events) are >>>>> processed in standard flume's way >>>>> (as with other sources). >>>>> >>>>> *1) Is there any way to fetch S3 objects >>>>> or I have to write >>>>> my own >>>>> Source?* >>>>> >>>>> >>>>> There is also second case. I want to >>>>> have flume configuration >>>>> dynamic. Flume sources can change in >>>>> time. New AWS key and S3 >>>>> bucket can be added or deleted. >>>>> >>>>> *2) Is there any other way to configure >>>>> Flume than by static >>>>> configuration file?* >>>>> >>>>> -- >>>>> Paweł Róg >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >> >> -- >> thanks >> ashish >> >> Blog: http://www.ashishpaliwal.com/blog >> My Photo Galleries: http://www.pbase.com/ashishpaliwal >> >