I was thinking the same. I think the store (DB, FS, ZK, something else) used to track state (what's been read from S3, what's been processed, etc.) would ideally be abstract/extensible.
Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Mon, Aug 11, 2014 at 9:33 AM, Ashish <paliwalash...@gmail.com> wrote: > May be best is not to depend on Zk directly. create some sort of > abstraction which can use Zk, DB or some other mechanism to share the > distributed state. How about keeping the distributed state out of picture > till we have a working S3 source, and plugin the meta-data information > piece to it later. It can store the state locally like SpoolingDirectory > Source. > > wdyt? > > > On Mon, Aug 11, 2014 at 12:53 PM, Jonathan Natkins <na...@streamsets.com> > wrote: > >> Yeah, I realize that. The reason I think it should be somewhat dependent >> upon FLUME-1491 is that ZooKeeper seems to me to be a pretty heavy-weight >> requirement just to use a particular source. FLUME-1491 would make Flume >> generally dependent upon ZooKeeper, which is a good transition point to >> start using ZK for other state that would be necessary for Flume >> components. Would you agree? >> >> >> On Sun, Aug 10, 2014 at 11:35 PM, Ashish <paliwalash...@gmail.com> wrote: >> >>> Seems like a bit of confusion here. Flume-1491 only deals with >>> configuration part, nothing else. Even if it get integrated, you would >>> still need to write/expose API to store meta-data info in Zk (Flume-1491 >>> doesn't bring that in). >>> >>> HTH ! >>> >>> >>> On Mon, Aug 11, 2014 at 11:39 AM, Jonathan Natkins <na...@streamsets.com >>> > wrote: >>> >>>> Given that FLUME-1491 hasn't been committed yet, and may still be a >>>> ways away, does it seem reasonable to punt on having multiple sources >>>> working off of a single bucket until ZK is integrated into Flume? The >>>> alternative probably requires write access to the S3 bucket to record some >>>> shared state, and would likely have to get rewritten once ZK integration >>>> happens anyway. >>>> >>>> >>>> On Tue, Aug 5, 2014 at 10:07 PM, Paweł Róg <pro...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I think that it is not possible to simply use SpoolDirectorySource. >>>>> Maybe it will be possible to use some elements of SpoolDirectory but >>>>> without touching it's code I think SpoolDirectory is not a good base. At >>>>> the very beginning SpoolDirectorySource does this: >>>>> >>>>> File directory = new File(spoolDirectory); >>>>> >>>>> ReliableSpoolingFileEventReader also instantiate File class. >>>>> There is also a question. How ReliableSpoolingFileEventReader stores >>>>> information about files that has been already processed in non-Deleting >>>>> mode? What happens after Flume restart? >>>>> >>>>> I agree with Jonathan that S3 source should be able to store last >>>>> processed file eg. in Zookeeper. >>>>> Another thing Jonathan: I think you shouldn't care about multiple >>>>> buckets processed handled by a single S3Source. As you wrote multiple >>>>> sources is the solution here. I thought it was already discussed but maybe >>>>> I'm wrong. >>>>> >>>>> >>>>> >> 2. Is it fair to assume that we're dealing with character files, >>>>> rather than binary objects? >>>>> >>>>> In my opinion S3 source can by default read file as simple text file >>>>> but also take in configuration a parameter with class name of a >>>>> "InputStream processor". This processor will we able to eg. unzip, >>>>> deserialize avro or read JSON and convert it into log events. What do you >>>>> think? >>>>> >>>>> -- >>>>> Paweł Róg >>>>> >>>>> 2014-08-06 5:12 GMT+02:00 Viral Bajaria <viral.baja...@gmail.com>: >>>>> >>>>> Agree to the feedback provided by Ashish. >>>>>> >>>>>> I have started writing one which is similar to the ExecSource, but I >>>>>> like the idea of doing something where spooldir takes over most of the >>>>>> hard >>>>>> work of spitting out events to sinks. Let me think more on how to >>>>>> structure >>>>>> that. >>>>>> >>>>>> Quick thinking out loud, I could create a source which extends the >>>>>> spooldir and just spins off a thread to manage moving things from S3 to >>>>>> the >>>>>> spooldir via a temporary directory. >>>>>> >>>>>> Regarding maintaining metadata, there are 2 ways: >>>>>> 1) DB: I currently maintain it in a database because there are a lot >>>>>> of other tools build around it >>>>>> 2) File: Just keep the info in memory and in file to help from crash >>>>>> recovery and/or high memory usage. >>>>>> >>>>>> Thanks, >>>>>> Viral >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Aug 5, 2014 at 8:04 PM, Ashish <paliwalash...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Sharing some random thoughts >>>>>>> >>>>>>> 1. Download the file using S3 SDK and let the SpoolDirectory >>>>>>> implementation take care of rest. Like a Decorator in front of >>>>>>> SpoolDirectory >>>>>>> >>>>>>> 2. Use S3 SDK to create InputStream of S3 objects directly in code >>>>>>> and create events out of it. >>>>>>> >>>>>>> Would be great to reuse an existing implementation which is based on >>>>>>> InputStream and feed it with S3 object input stream, concern of metadata >>>>>>> storage still remains. Most often S3 objects are stored in compressed >>>>>>> form, >>>>>>> so this source would need to take care of compression gz/avro/others. >>>>>>> >>>>>>> Best is to start with something that works and then start adding >>>>>>> more features to it. >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins < >>>>>>> na...@streamsets.com> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I started trying to write some code on this, and realized there are >>>>>>>> a number of issues that need to be discussed in order to really design >>>>>>>> this >>>>>>>> feature effectively. The requirements that have been discussed thus >>>>>>>> far are: >>>>>>>> >>>>>>>> 1. Fetching data from S3 periodically >>>>>>>> 2. Fetching data from multiple S3 buckets -- This may be something >>>>>>>> that should be punted on until later. For a first implementation, this >>>>>>>> could be solved just by having multiple sources, each with a single S3 >>>>>>>> bucket >>>>>>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can >>>>>>>> you clarify what you mean by this?* >>>>>>>> 4. Dynamically reconfigure the source -- This is blocked by >>>>>>>> FLUME-1491, so I think this is out-of-scope for discussions at the >>>>>>>> moment >>>>>>>> >>>>>>>> Some questions I want to try to answer: >>>>>>>> >>>>>>>> 1. How do we identify and track objects that need to be processed >>>>>>>> versus objects that have been processed already? >>>>>>>> 1a. What about if we want to have multiple sources working against >>>>>>>> the same bucket to speed processing? >>>>>>>> 2. Is it fair to assume that we're dealing with character files, >>>>>>>> rather than binary objects? >>>>>>>> >>>>>>>> For the first question, if we ignore the multiple source >>>>>>>> extension of the question, I think the simplest answer is to do >>>>>>>> something >>>>>>>> on the local filesystem, like have a tracking directory that contains a >>>>>>>> list of to-be-processed objects and a list of already-processed >>>>>>>> objects. >>>>>>>> However, if the source goes down, what should the restart semantics >>>>>>>> be? It >>>>>>>> seems that the ideal situation is to store this state in a system like >>>>>>>> ZooKeeper, which would ensure that a number of sources could operate >>>>>>>> off of >>>>>>>> the same bucket, but this probably requires FLUME-1491 first. >>>>>>>> >>>>>>>> For the second question, my feeling was just that we should work >>>>>>>> with similar assumptions to how the SpoolingDirectorySource works, >>>>>>>> where >>>>>>>> each line is a separate event. Does that seem reasonable? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Natty >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pro...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> Thanks for explanation Jonathan. I think I will also start working >>>>>>>>> on it. When you have any patch (even draft) I'd be glad if you can >>>>>>>>> attach >>>>>>>>> it in JIRA. I'll do the same. >>>>>>>>> What do you think? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Paweł Róg >>>>>>>>> >>>>>>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan < >>>>>>>>> hshreedha...@cloudera.com>: >>>>>>>>> >>>>>>>>> +1 on an S3 Source. I would gladly review. >>>>>>>>>> >>>>>>>>>> Jonathan Natkins wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hey Pawel, >>>>>>>>>> >>>>>>>>>> My intention is to start working on it, but I don't know exactly >>>>>>>>>> how >>>>>>>>>> long it will take, and I'm not a committer, so time estimates >>>>>>>>>> would >>>>>>>>>> have to be taken with a grain of salt regardless. If this is >>>>>>>>>> something >>>>>>>>>> that you need urgently, it may not be ideal to wait for me to >>>>>>>>>> start >>>>>>>>>> building something for yourself. >>>>>>>>>> >>>>>>>>>> That said, as mentioned in the other thread, dynamic >>>>>>>>>> configuration can >>>>>>>>>> be done by refreshing the configuration files across the set of >>>>>>>>>> Flume >>>>>>>>>> agents. It's certainly not as great as having a single place to >>>>>>>>>> change >>>>>>>>>> it (e.g. ZooKeeper), but it's a way to get the job done. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Natty >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <pro...@gmail.com >>>>>>>>>> <mailto:pro...@gmail.com>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> Jonathan how should we interpret your last e-mail? You opened >>>>>>>>>> an >>>>>>>>>> JIRA issue and want to start implementing this and do you >>>>>>>>>> have any >>>>>>>>>> estimate how long it will take? >>>>>>>>>> >>>>>>>>>> I think the biggest challenge here is to have dynamic >>>>>>>>>> configuration of Flume. It doesn't seem to be part of >>>>>>>>>> FLUME-2437 >>>>>>>>>> issue. Am I right? >>>>>>>>>> >>>>>>>>>> > Would you need to be able to pull files from multiple S3 >>>>>>>>>> directories with the same source? >>>>>>>>>> >>>>>>>>>> I think we don't need to track multiple S3 buckets with a >>>>>>>>>> single >>>>>>>>>> source. I just imagine an approach where each S3 source can be >>>>>>>>>> added or deleted on demand and attached to any Channel. I'm >>>>>>>>>> only >>>>>>>>>> afraid about this dynamic configuration. I'll open a new >>>>>>>>>> thread >>>>>>>>>> about this. It seems we have two totally separate things: >>>>>>>>>> * build S3 source >>>>>>>>>> * make flume configurable dynamically >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Paweł >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2014-08-01 9:51 GMT+02:00 Otis Gospodnetic >>>>>>>>>> <otis.gospodne...@gmail.com <mailto: >>>>>>>>>> otis.gospodne...@gmail.com>>: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins >>>>>>>>>> <na...@streamsets.com <mailto:na...@streamsets.com>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hey all, >>>>>>>>>> >>>>>>>>>> I created a JIRA for this: >>>>>>>>>> https://issues.apache.org/jira/browse/FLUME-2437 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks! Should Fix Version be set to the next Flume >>>>>>>>>> release >>>>>>>>>> version? >>>>>>>>>> >>>>>>>>>> I thought I'd start working on one myself, which can >>>>>>>>>> hopefully be contributed back. I'm curious: do you >>>>>>>>>> have >>>>>>>>>> particular requirements? Based on the emails in this >>>>>>>>>> thread, it sounds like the original goal was to have >>>>>>>>>> something that's like a SpoolDirectorySource that just >>>>>>>>>> picks up new files from S3. Is that accurate? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes, I think so. We need to be able to: >>>>>>>>>> * fetch data (logs for pulling them in Logsene >>>>>>>>>> <http://sematext.com/logsene/>) from S3 periodically >>>>>>>>>> (e.g. >>>>>>>>>> >>>>>>>>>> every 1 min, every 5 min, etc.) >>>>>>>>>> * fetch data from multiple S3 buckets >>>>>>>>>> * associate an S3 bucket with a user/token/key >>>>>>>>>> * dynamically (i.e. without editing/writing config files >>>>>>>>>> stored on disk) add new S3 buckets from which data should >>>>>>>>>> be fetch >>>>>>>>>> * dynamically (i.e. without editing/writing config files >>>>>>>>>> stored on disk) stop fetching data from some S3 buckets >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Would you need to be able to pull files from multiple >>>>>>>>>> S3 >>>>>>>>>> directories with the same source? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I think the above addresses this question. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Natty >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> Otis >>>>>>>>>> -- >>>>>>>>>> Performance Monitoring * Log Analytics * Search Analytics >>>>>>>>>> Solr & Elasticsearch Support * http://sematext.com/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic >>>>>>>>>> <otis.gospodne...@gmail.com >>>>>>>>>> <mailto:otis.gospodne...@gmail.com>> wrote: >>>>>>>>>> >>>>>>>>>> +1 for seeing S3Source, starting with a JIRA >>>>>>>>>> issue. >>>>>>>>>> >>>>>>>>>> But being able to dynamically add/remove S3 >>>>>>>>>> buckets >>>>>>>>>> from which to pull data seems important. >>>>>>>>>> >>>>>>>>>> Any suggestions for how to approach that? >>>>>>>>>> >>>>>>>>>> Otis >>>>>>>>>> -- >>>>>>>>>> Performance Monitoring * Log Analytics * Search >>>>>>>>>> Analytics >>>>>>>>>> Solr & Elasticsearch Support * >>>>>>>>>> http://sematext.com/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan >>>>>>>>>> <hshreedha...@cloudera.com >>>>>>>>>> <mailto:hshreedha...@cloudera.com>> wrote: >>>>>>>>>> >>>>>>>>>> Please go ahead and file a jira. If you are >>>>>>>>>> willing to submit a patch, you can post it on >>>>>>>>>> the >>>>>>>>>> jira. >>>>>>>>>> >>>>>>>>>> Viral Bajaria wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I have a similar use case that cropped up >>>>>>>>>> yesterday. I saw the archive >>>>>>>>>> and found that there was a recommendation to >>>>>>>>>> build it as Sharninder >>>>>>>>>> suggested. >>>>>>>>>> >>>>>>>>>> For now, I went down the route of writing a >>>>>>>>>> python script which >>>>>>>>>> downloads from S3 and puts the files in a >>>>>>>>>> directory which is >>>>>>>>>> configured to be picked up via a spooldir. >>>>>>>>>> >>>>>>>>>> I would prefer to get a direct S3 source, and >>>>>>>>>> maybe we could >>>>>>>>>> collaborate on it and open-source it. Let me >>>>>>>>>> know >>>>>>>>>> if you prefer that >>>>>>>>>> and we can work directly on it by creating a >>>>>>>>>> JIRA. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Viral >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jul 31, 2014 at 10:26 AM, Hari >>>>>>>>>> Shreedharan >>>>>>>>>> <hshreedha...@cloudera.com >>>>>>>>>> <mailto:hshreedha...@cloudera.com> >>>>>>>>>> <mailto:hshreedha...@cloudera.com >>>>>>>>>> >>>>>>>>>> <mailto:hshreedha...@cloudera.com>>> wrote: >>>>>>>>>> >>>>>>>>>> In both cases, Sharninder is right :) >>>>>>>>>> >>>>>>>>>> Sharninder wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> As far as I know, there is no (open >>>>>>>>>> source) >>>>>>>>>> implementation of an S3 >>>>>>>>>> source, so yes, you'll have to implement >>>>>>>>>> your own. You'll have to >>>>>>>>>> implement a Pollable source and the dev >>>>>>>>>> documentation has an outline >>>>>>>>>> that you can use. You can also look at the >>>>>>>>>> existing Execsource and >>>>>>>>>> work your way up. >>>>>>>>>> >>>>>>>>>> As far as I know, there is no way to >>>>>>>>>> configure flume without >>>>>>>>>> using the >>>>>>>>>> configuration file. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jul 31, 2014 at 7:57 PM, Paweł >>>>>>>>>> <pro...@gmail.com <mailto:pro...@gmail.com> >>>>>>>>>> <mailto:pro...@gmail.com <mailto: >>>>>>>>>> pro...@gmail.com>> >>>>>>>>>> <mailto:pro...@gmail.com >>>>>>>>>> <mailto:pro...@gmail.com> >>>>>>>>>> <mailto:pro...@gmail.com >>>>>>>>>> <mailto:pro...@gmail.com>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> I'm wondering if Flume is able to read >>>>>>>>>> directly from S3. >>>>>>>>>> >>>>>>>>>> I'll describe my case. I have log >>>>>>>>>> files >>>>>>>>>> stored in AWS S3. I have >>>>>>>>>> to fetch periodically new S3 objects >>>>>>>>>> and >>>>>>>>>> read log lines from it. >>>>>>>>>> Than use log lines (events) are >>>>>>>>>> processed in standard flume's way >>>>>>>>>> (as with other sources). >>>>>>>>>> >>>>>>>>>> *1) Is there any way to fetch S3 >>>>>>>>>> objects >>>>>>>>>> or I have to write >>>>>>>>>> my own >>>>>>>>>> Source?* >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> There is also second case. I want to >>>>>>>>>> have flume configuration >>>>>>>>>> dynamic. Flume sources can change in >>>>>>>>>> time. New AWS key and S3 >>>>>>>>>> bucket can be added or deleted. >>>>>>>>>> >>>>>>>>>> *2) Is there any other way to >>>>>>>>>> configure >>>>>>>>>> Flume than by static >>>>>>>>>> configuration file?* >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Paweł Róg >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> thanks >>>>>>> ashish >>>>>>> >>>>>>> Blog: http://www.ashishpaliwal.com/blog >>>>>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> thanks >>> ashish >>> >>> Blog: http://www.ashishpaliwal.com/blog >>> My Photo Galleries: http://www.pbase.com/ashishpaliwal >>> >> >> > > > -- > thanks > ashish > > Blog: http://www.ashishpaliwal.com/blog > My Photo Galleries: http://www.pbase.com/ashishpaliwal >