Ashish, I've put some comments inline. On Tuesday, August 5, 2014, Ashish <paliwalash...@gmail.com> wrote:
> Sharing some random thoughts > > 1. Download the file using S3 SDK and let the SpoolDirectory > implementation take care of rest. Like a Decorator in front of > SpoolDirectory > > This works for the simple case, but I don't think this is an ideal solution. My primary concern is that S3's max file size is 5TB, so downloading the object to local disk may not be possible. > 2. Use S3 SDK to create InputStream of S3 objects directly in code and > create events out of it. > > Would be great to reuse an existing implementation which is based on > InputStream and feed it with S3 object input stream, concern of metadata > storage still remains. Most often S3 objects are stored in compressed form, > so this source would need to take care of compression gz/avro/others. > > Best is to start with something that works and then start adding more > features to it. > > > On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com > <javascript:_e(%7B%7D,'cvml','na...@streamsets.com');>> wrote: > >> Hi all, >> >> I started trying to write some code on this, and realized there are a >> number of issues that need to be discussed in order to really design this >> feature effectively. The requirements that have been discussed thus far are: >> >> 1. Fetching data from S3 periodically >> 2. Fetching data from multiple S3 buckets -- This may be something that >> should be punted on until later. For a first implementation, this could be >> solved just by having multiple sources, each with a single S3 bucket >> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you >> clarify what you mean by this?* >> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491, >> so I think this is out-of-scope for discussions at the moment >> >> Some questions I want to try to answer: >> >> 1. How do we identify and track objects that need to be processed versus >> objects that have been processed already? >> 1a. What about if we want to have multiple sources working against the >> same bucket to speed processing? >> 2. Is it fair to assume that we're dealing with character files, rather >> than binary objects? >> >> For the first question, if we ignore the multiple source extension of >> the question, I think the simplest answer is to do something on the local >> filesystem, like have a tracking directory that contains a list of >> to-be-processed objects and a list of already-processed objects. However, >> if the source goes down, what should the restart semantics be? It seems >> that the ideal situation is to store this state in a system like ZooKeeper, >> which would ensure that a number of sources could operate off of the same >> bucket, but this probably requires FLUME-1491 first. >> >> For the second question, my feeling was just that we should work with >> similar assumptions to how the SpoolingDirectorySource works, where each >> line is a separate event. Does that seem reasonable? >> >> Thanks, >> Natty >> >> >> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pro...@gmail.com >> <javascript:_e(%7B%7D,'cvml','pro...@gmail.com');>> wrote: >> >>> Hi, >>> Thanks for explanation Jonathan. I think I will also start working on >>> it. When you have any patch (even draft) I'd be glad if you can attach it >>> in JIRA. I'll do the same. >>> What do you think? >>> >>> -- >>> Paweł Róg >>> >>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hshreedha...@cloudera.com >>> <javascript:_e(%7B%7D,'cvml','hshreedha...@cloudera.com');>>: >>> >>> +1 on an S3 Source. I would gladly review. >>>> >>>> Jonathan Natkins wrote: >>>> >>>> >>>> Hey Pawel, >>>> >>>> My intention is to start working on it, but I don't know exactly how >>>> long it will take, and I'm not a committer, so time estimates would >>>> have to be taken with a grain of salt regardless. If this is something >>>> that you need urgently, it may not be ideal to wait for me to start >>>> building something for yourself. >>>> >>>> That said, as mentioned in the other thread, dynamic configuration can >>>> be done by refreshing the configuration files across the set of Flume >>>> agents. It's certainly not as great as having a single place to change >>>> it (e.g. ZooKeeper), but it's a way to get the job done. >>>> >>>> Thanks, >>>> Natty >>>> >>>> >>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <pro...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','pro...@gmail.com');> >>>> <mailto:pro...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','pro...@gmail.com');>>> wrote: >>>> >>>> Hi, >>>> Jonathan how should we interpret your last e-mail? You opened an >>>> JIRA issue and want to start implementing this and do you have any >>>> estimate how long it will take? >>>> >>>> I think the biggest challenge here is to have dynamic >>>> configuration of Flume. It doesn't seem to be part of FLUME-2437 >>>> issue. Am I right? >>>> >>>> > Would you need to be able to pull files from multiple S3 >>>> directories with the same source? >>>> >>>> I think we don't need to track multiple S3 buckets with a single >>>> source. I just imagine an approach where each S3 source can be >>>> added or deleted on demand and attached to any Channel. I'm only >>>> afraid about this dynamic configuration. I'll open a new thread >>>> about this. It seems we have two totally separate things: >>>> * build S3 source >>>> * make flume configurable dynamically >>>> >>>> -- >>>> Paweł >>>> >>>> >>>> 2014-08-01 9:51 GMT+02:00 Otis Gospodnetic >>>> <otis.gospodne...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','otis.gospodne...@gmail.com');> <mailto: >>>> otis.gospodne...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','otis.gospodne...@gmail.com');>>>: >>>> >>>> >>>> Hi, >>>> >>>> On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins >>>> <na...@streamsets.com >>>> <javascript:_e(%7B%7D,'cvml','na...@streamsets.com');> <mailto: >>>> na...@streamsets.com >>>> <javascript:_e(%7B%7D,'cvml','na...@streamsets.com');>>> wrote: >>>> >>>> Hey all, >>>> >>>> I created a JIRA for this: >>>> https://issues.apache.org/jira/browse/FLUME-2437 >>>> >>>> >>>> Thanks! Should Fix Version be set to the next Flume release >>>> version? >>>> >>>> I thought I'd start working on one myself, which can >>>> hopefully be contributed back. I'm curious: do you have >>>> particular requirements? Based on the emails in this >>>> thread, it sounds like the original goal was to have >>>> something that's like a SpoolDirectorySource that just >>>> picks up new files from S3. Is that accurate? >>>> >>>> >>>> Yes, I think so. We need to be able to: >>>> * fetch data (logs for pulling them in Logsene >>>> <http://sematext.com/logsene/>) from S3 periodically (e.g. >>>> >>>> every 1 min, every 5 min, etc.) >>>> * fetch data from multiple S3 buckets >>>> * associate an S3 bucket with a user/token/key >>>> * dynamically (i.e. without editing/writing config files >>>> stored on disk) add new S3 buckets from which data should be >>>> fetch >>>> * dynamically (i.e. without editing/writing config files >>>> stored on disk) stop fetching data from some S3 buckets >>>> >>>> >>>> Would you need to be able to pull files from multiple S3 >>>> directories with the same source? >>>> >>>> >>>> I think the above addresses this question. >>>> >>>> Thanks, >>>> Natty >>>> >>>> >>>> Thanks! >>>> >>>> Otis >>>> -- >>>> Performance Monitoring * Log Analytics * Search Analytics >>>> Solr & Elasticsearch Support * http://sematext.com/ >>>> >>>> >>>> >>>> On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic >>>> <otis.gospodne...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','otis.gospodne...@gmail.com');> >>>> <mailto:otis.gospodne...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','otis.gospodne...@gmail.com');>>> wrote: >>>> >>>> +1 for seeing S3Source, starting with a JIRA issue. >>>> >>>> But being able to dynamically add/remove S3 buckets >>>> from which to pull data seems important. >>>> >>>> Any suggestions for how to approach that? >>>> >>>> Otis >>>> -- >>>> Performance Monitoring * Log Analytics * Search >>>> Analytics >>>> Solr & Elasticsearch Support * http://sematext.com/ >>>> >>>> >>>> On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan >>>> <hshreedha...@cloudera.com >>>> <javascript:_e(%7B%7D,'cvml','hshreedha...@cloudera.com');> >>>> <mailto:hshreedha...@cloudera.com >>>> <javascript:_e(%7B%7D,'cvml','hshreedha...@cloudera.com');>>> wrote: >>>> >>>> Please go ahead and file a jira. If you are >>>> willing to submit a patch, you can post it on the >>>> jira. >>>> >>>> Viral Bajaria wrote: >>>> >>>> >>>> >>>> I have a similar use case that cropped up >>>> yesterday. I saw the archive >>>> and found that there was a recommendation to >>>> build it as Sharninder >>>> suggested. >>>> >>>> For now, I went down the route of writing a >>>> python script which >>>> downloads from S3 and puts the files in a >>>> directory which is >>>> configured to be picked up via a spooldir. >>>> >>>> I would prefer to get a direct S3 source, and >>>> maybe we could >>>> collaborate on it and open-source it. Let me know >>>> if you prefer that >>>> and we can work directly on it by creating a JIRA. >>>> >>>> Thanks, >>>> Viral >>>> >>>> >>>> >>>> On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan >>>> <hshreedha...@cloudera.com >>>> <javascript:_e(%7B%7D,'cvml','hshreedha...@cloudera.com');> >>>> <mailto:hshreedha...@cloudera.com >>>> <javascript:_e(%7B%7D,'cvml','hshreedha...@cloudera.com');>> >>>> <mailto:hshreedha...@cloudera.com >>>> <javascript:_e(%7B%7D,'cvml','hshreedha...@cloudera.com');> >>>> >>>> <mailto:hshreedha...@cloudera.com >>>> <javascript:_e(%7B%7D,'cvml','hshreedha...@cloudera.com');>>>> wrote: >>>> >>>> In both cases, Sharninder is right :) >>>> >>>> Sharninder wrote: >>>> >>>> >>>> >>>> >>>> As far as I know, there is no (open source) >>>> implementation of an S3 >>>> source, so yes, you'll have to implement >>>> your own. You'll have to >>>> implement a Pollable source and the dev >>>> documentation has an outline >>>> that you can use. You can also look at the >>>> existing Execsource and >>>> work your way up. >>>> >>>> As far as I know, there is no way to >>>> configure flume without >>>> using the >>>> configuration file. >>>> >>>> >>>> >>>> On Thu, Jul 31, 2014 at 7:57 PM, Paweł >>>> <pro...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','pro...@gmail.com');> <mailto: >>>> pro...@gmail.com <javascript:_e(%7B%7D,'cvml','pro...@gmail.com');>> >>>> <mailto:pro...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','pro...@gmail.com');> <mailto: >>>> pro...@gmail.com <javascript:_e(%7B%7D,'cvml','pro...@gmail.com');>>> >>>> <mailto:pro...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','pro...@gmail.com');> >>>> <mailto:pro...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','pro...@gmail.com');>> >>>> <mailto:pro...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','pro...@gmail.com');> >>>> <mailto:pro...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','pro...@gmail.com');>>>>> wrote: >>>> >>>> Hi, >>>> I'm wondering if Flume is able to read >>>> directly from S3. >>>> >>>> I'll describe my case. I have log files >>>> stored in AWS S3. I have >>>> to fetch periodically new S3 objects and >>>> read log lines from it. >>>> Than use log lines (events) are >>>> processed in standard flume's way >>>> (as with other sources). >>>> >>>> *1) Is there any way to fetch S3 objects >>>> or I have to write >>>> my own >>>> Source?* >>>> >>>> >>>> There is also second case. I want to >>>> have flume configuration >>>> dynamic. Flume sources can change in >>>> time. New AWS key and S3 >>>> bucket can be added or deleted. >>>> >>>> *2) Is there any other way to configure >>>> Flume than by static >>>> configuration file?* >>>> >>>> -- >>>> Paweł Róg >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >> > > > -- > thanks > ashish > > Blog: http://www.ashishpaliwal.com/blog > My Photo Galleries: http://www.pbase.com/ashishpaliwal >