Seems like a bit of confusion here. Flume-1491 only deals with configuration part, nothing else. Even if it get integrated, you would still need to write/expose API to store meta-data info in Zk (Flume-1491 doesn't bring that in).
HTH ! On Mon, Aug 11, 2014 at 11:39 AM, Jonathan Natkins <na...@streamsets.com> wrote: > Given that FLUME-1491 hasn't been committed yet, and may still be a ways > away, does it seem reasonable to punt on having multiple sources working > off of a single bucket until ZK is integrated into Flume? The alternative > probably requires write access to the S3 bucket to record some shared > state, and would likely have to get rewritten once ZK integration happens > anyway. > > > On Tue, Aug 5, 2014 at 10:07 PM, Paweł Róg <pro...@gmail.com> wrote: > >> Hi, >> >> I think that it is not possible to simply use SpoolDirectorySource. Maybe >> it will be possible to use some elements of SpoolDirectory but without >> touching it's code I think SpoolDirectory is not a good base. At the very >> beginning SpoolDirectorySource does this: >> >> File directory = new File(spoolDirectory); >> >> ReliableSpoolingFileEventReader also instantiate File class. >> There is also a question. How ReliableSpoolingFileEventReader stores >> information about files that has been already processed in non-Deleting >> mode? What happens after Flume restart? >> >> I agree with Jonathan that S3 source should be able to store last >> processed file eg. in Zookeeper. >> Another thing Jonathan: I think you shouldn't care about multiple buckets >> processed handled by a single S3Source. As you wrote multiple sources is >> the solution here. I thought it was already discussed but maybe I'm wrong. >> >> >> >> 2. Is it fair to assume that we're dealing with character files, >> rather than binary objects? >> >> In my opinion S3 source can by default read file as simple text file but >> also take in configuration a parameter with class name of a "InputStream >> processor". This processor will we able to eg. unzip, deserialize avro or >> read JSON and convert it into log events. What do you think? >> >> -- >> Paweł Róg >> >> 2014-08-06 5:12 GMT+02:00 Viral Bajaria <viral.baja...@gmail.com>: >> >> Agree to the feedback provided by Ashish. >>> >>> I have started writing one which is similar to the ExecSource, but I >>> like the idea of doing something where spooldir takes over most of the hard >>> work of spitting out events to sinks. Let me think more on how to structure >>> that. >>> >>> Quick thinking out loud, I could create a source which extends the >>> spooldir and just spins off a thread to manage moving things from S3 to the >>> spooldir via a temporary directory. >>> >>> Regarding maintaining metadata, there are 2 ways: >>> 1) DB: I currently maintain it in a database because there are a lot of >>> other tools build around it >>> 2) File: Just keep the info in memory and in file to help from crash >>> recovery and/or high memory usage. >>> >>> Thanks, >>> Viral >>> >>> >>> >>> >>> On Tue, Aug 5, 2014 at 8:04 PM, Ashish <paliwalash...@gmail.com> wrote: >>> >>>> Sharing some random thoughts >>>> >>>> 1. Download the file using S3 SDK and let the SpoolDirectory >>>> implementation take care of rest. Like a Decorator in front of >>>> SpoolDirectory >>>> >>>> 2. Use S3 SDK to create InputStream of S3 objects directly in code and >>>> create events out of it. >>>> >>>> Would be great to reuse an existing implementation which is based on >>>> InputStream and feed it with S3 object input stream, concern of metadata >>>> storage still remains. Most often S3 objects are stored in compressed form, >>>> so this source would need to take care of compression gz/avro/others. >>>> >>>> Best is to start with something that works and then start adding more >>>> features to it. >>>> >>>> >>>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I started trying to write some code on this, and realized there are a >>>>> number of issues that need to be discussed in order to really design this >>>>> feature effectively. The requirements that have been discussed thus far >>>>> are: >>>>> >>>>> 1. Fetching data from S3 periodically >>>>> 2. Fetching data from multiple S3 buckets -- This may be something >>>>> that should be punted on until later. For a first implementation, this >>>>> could be solved just by having multiple sources, each with a single S3 >>>>> bucket >>>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you >>>>> clarify what you mean by this?* >>>>> 4. Dynamically reconfigure the source -- This is blocked by >>>>> FLUME-1491, so I think this is out-of-scope for discussions at the moment >>>>> >>>>> Some questions I want to try to answer: >>>>> >>>>> 1. How do we identify and track objects that need to be processed >>>>> versus objects that have been processed already? >>>>> 1a. What about if we want to have multiple sources working against the >>>>> same bucket to speed processing? >>>>> 2. Is it fair to assume that we're dealing with character files, >>>>> rather than binary objects? >>>>> >>>>> For the first question, if we ignore the multiple source extension >>>>> of the question, I think the simplest answer is to do something on the >>>>> local filesystem, like have a tracking directory that contains a list of >>>>> to-be-processed objects and a list of already-processed objects. However, >>>>> if the source goes down, what should the restart semantics be? It seems >>>>> that the ideal situation is to store this state in a system like >>>>> ZooKeeper, >>>>> which would ensure that a number of sources could operate off of the same >>>>> bucket, but this probably requires FLUME-1491 first. >>>>> >>>>> For the second question, my feeling was just that we should work with >>>>> similar assumptions to how the SpoolingDirectorySource works, where each >>>>> line is a separate event. Does that seem reasonable? >>>>> >>>>> Thanks, >>>>> Natty >>>>> >>>>> >>>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pro...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> Thanks for explanation Jonathan. I think I will also start working on >>>>>> it. When you have any patch (even draft) I'd be glad if you can attach it >>>>>> in JIRA. I'll do the same. >>>>>> What do you think? >>>>>> >>>>>> -- >>>>>> Paweł Róg >>>>>> >>>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan < >>>>>> hshreedha...@cloudera.com>: >>>>>> >>>>>> +1 on an S3 Source. I would gladly review. >>>>>>> >>>>>>> Jonathan Natkins wrote: >>>>>>> >>>>>>> >>>>>>> Hey Pawel, >>>>>>> >>>>>>> My intention is to start working on it, but I don't know exactly how >>>>>>> long it will take, and I'm not a committer, so time estimates would >>>>>>> have to be taken with a grain of salt regardless. If this is >>>>>>> something >>>>>>> that you need urgently, it may not be ideal to wait for me to start >>>>>>> building something for yourself. >>>>>>> >>>>>>> That said, as mentioned in the other thread, dynamic configuration >>>>>>> can >>>>>>> be done by refreshing the configuration files across the set of >>>>>>> Flume >>>>>>> agents. It's certainly not as great as having a single place to >>>>>>> change >>>>>>> it (e.g. ZooKeeper), but it's a way to get the job done. >>>>>>> >>>>>>> Thanks, >>>>>>> Natty >>>>>>> >>>>>>> >>>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <pro...@gmail.com >>>>>>> <mailto:pro...@gmail.com>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> Jonathan how should we interpret your last e-mail? You opened an >>>>>>> JIRA issue and want to start implementing this and do you have >>>>>>> any >>>>>>> estimate how long it will take? >>>>>>> >>>>>>> I think the biggest challenge here is to have dynamic >>>>>>> configuration of Flume. It doesn't seem to be part of FLUME-2437 >>>>>>> issue. Am I right? >>>>>>> >>>>>>> > Would you need to be able to pull files from multiple S3 >>>>>>> directories with the same source? >>>>>>> >>>>>>> I think we don't need to track multiple S3 buckets with a single >>>>>>> source. I just imagine an approach where each S3 source can be >>>>>>> added or deleted on demand and attached to any Channel. I'm only >>>>>>> afraid about this dynamic configuration. I'll open a new thread >>>>>>> about this. It seems we have two totally separate things: >>>>>>> * build S3 source >>>>>>> * make flume configurable dynamically >>>>>>> >>>>>>> -- >>>>>>> Paweł >>>>>>> >>>>>>> >>>>>>> 2014-08-01 9:51 GMT+02:00 Otis Gospodnetic >>>>>>> <otis.gospodne...@gmail.com <mailto:otis.gospodne...@gmail.com >>>>>>> >>: >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins >>>>>>> <na...@streamsets.com <mailto:na...@streamsets.com>> wrote: >>>>>>> >>>>>>> Hey all, >>>>>>> >>>>>>> I created a JIRA for this: >>>>>>> https://issues.apache.org/jira/browse/FLUME-2437 >>>>>>> >>>>>>> >>>>>>> Thanks! Should Fix Version be set to the next Flume release >>>>>>> version? >>>>>>> >>>>>>> I thought I'd start working on one myself, which can >>>>>>> hopefully be contributed back. I'm curious: do you have >>>>>>> particular requirements? Based on the emails in this >>>>>>> thread, it sounds like the original goal was to have >>>>>>> something that's like a SpoolDirectorySource that just >>>>>>> picks up new files from S3. Is that accurate? >>>>>>> >>>>>>> >>>>>>> Yes, I think so. We need to be able to: >>>>>>> * fetch data (logs for pulling them in Logsene >>>>>>> <http://sematext.com/logsene/>) from S3 periodically (e.g. >>>>>>> >>>>>>> every 1 min, every 5 min, etc.) >>>>>>> * fetch data from multiple S3 buckets >>>>>>> * associate an S3 bucket with a user/token/key >>>>>>> * dynamically (i.e. without editing/writing config files >>>>>>> stored on disk) add new S3 buckets from which data should be >>>>>>> fetch >>>>>>> * dynamically (i.e. without editing/writing config files >>>>>>> stored on disk) stop fetching data from some S3 buckets >>>>>>> >>>>>>> >>>>>>> Would you need to be able to pull files from multiple S3 >>>>>>> directories with the same source? >>>>>>> >>>>>>> >>>>>>> I think the above addresses this question. >>>>>>> >>>>>>> Thanks, >>>>>>> Natty >>>>>>> >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> Otis >>>>>>> -- >>>>>>> Performance Monitoring * Log Analytics * Search Analytics >>>>>>> Solr & Elasticsearch Support * http://sematext.com/ >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic >>>>>>> <otis.gospodne...@gmail.com >>>>>>> <mailto:otis.gospodne...@gmail.com>> wrote: >>>>>>> >>>>>>> +1 for seeing S3Source, starting with a JIRA issue. >>>>>>> >>>>>>> But being able to dynamically add/remove S3 buckets >>>>>>> from which to pull data seems important. >>>>>>> >>>>>>> Any suggestions for how to approach that? >>>>>>> >>>>>>> Otis >>>>>>> -- >>>>>>> Performance Monitoring * Log Analytics * Search >>>>>>> Analytics >>>>>>> Solr & Elasticsearch Support * http://sematext.com/ >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan >>>>>>> <hshreedha...@cloudera.com >>>>>>> <mailto:hshreedha...@cloudera.com>> wrote: >>>>>>> >>>>>>> Please go ahead and file a jira. If you are >>>>>>> willing to submit a patch, you can post it on the >>>>>>> jira. >>>>>>> >>>>>>> Viral Bajaria wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> I have a similar use case that cropped up >>>>>>> yesterday. I saw the archive >>>>>>> and found that there was a recommendation to >>>>>>> build it as Sharninder >>>>>>> suggested. >>>>>>> >>>>>>> For now, I went down the route of writing a >>>>>>> python script which >>>>>>> downloads from S3 and puts the files in a >>>>>>> directory which is >>>>>>> configured to be picked up via a spooldir. >>>>>>> >>>>>>> I would prefer to get a direct S3 source, and >>>>>>> maybe we could >>>>>>> collaborate on it and open-source it. Let me know >>>>>>> if you prefer that >>>>>>> and we can work directly on it by creating a >>>>>>> JIRA. >>>>>>> >>>>>>> Thanks, >>>>>>> Viral >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 31, 2014 at 10:26 AM, Hari >>>>>>> Shreedharan >>>>>>> <hshreedha...@cloudera.com >>>>>>> <mailto:hshreedha...@cloudera.com> >>>>>>> <mailto:hshreedha...@cloudera.com >>>>>>> >>>>>>> <mailto:hshreedha...@cloudera.com>>> wrote: >>>>>>> >>>>>>> In both cases, Sharninder is right :) >>>>>>> >>>>>>> Sharninder wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> As far as I know, there is no (open source) >>>>>>> implementation of an S3 >>>>>>> source, so yes, you'll have to implement >>>>>>> your own. You'll have to >>>>>>> implement a Pollable source and the dev >>>>>>> documentation has an outline >>>>>>> that you can use. You can also look at the >>>>>>> existing Execsource and >>>>>>> work your way up. >>>>>>> >>>>>>> As far as I know, there is no way to >>>>>>> configure flume without >>>>>>> using the >>>>>>> configuration file. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 31, 2014 at 7:57 PM, Paweł >>>>>>> <pro...@gmail.com <mailto:pro...@gmail.com> >>>>>>> <mailto:pro...@gmail.com <mailto: >>>>>>> pro...@gmail.com>> >>>>>>> <mailto:pro...@gmail.com >>>>>>> <mailto:pro...@gmail.com> >>>>>>> <mailto:pro...@gmail.com >>>>>>> <mailto:pro...@gmail.com>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> I'm wondering if Flume is able to read >>>>>>> directly from S3. >>>>>>> >>>>>>> I'll describe my case. I have log files >>>>>>> stored in AWS S3. I have >>>>>>> to fetch periodically new S3 objects and >>>>>>> read log lines from it. >>>>>>> Than use log lines (events) are >>>>>>> processed in standard flume's way >>>>>>> (as with other sources). >>>>>>> >>>>>>> *1) Is there any way to fetch S3 objects >>>>>>> or I have to write >>>>>>> my own >>>>>>> Source?* >>>>>>> >>>>>>> >>>>>>> There is also second case. I want to >>>>>>> have flume configuration >>>>>>> dynamic. Flume sources can change in >>>>>>> time. New AWS key and S3 >>>>>>> bucket can be added or deleted. >>>>>>> >>>>>>> *2) Is there any other way to configure >>>>>>> Flume than by static >>>>>>> configuration file?* >>>>>>> >>>>>>> -- >>>>>>> Paweł Róg >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> thanks >>>> ashish >>>> >>>> Blog: http://www.ashishpaliwal.com/blog >>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal >>>> >>> >>> >> > -- thanks ashish Blog: http://www.ashishpaliwal.com/blog My Photo Galleries: http://www.pbase.com/ashishpaliwal