Hi Hari, Just curios about the performance improvement, can you provide the number of the JIRA that improves performance in 1.3.1?
Thanks, Pankaj On Wed, Aug 14, 2013 at 2:23 PM, Hari Shreedharan <[email protected] > wrote: > Flume v1.3.0 had a major performance issue which is why 1.3.1 was > released immediately after. The current stable release is 1.4.0 - so you > should use that. > > 1. Can you detail this point? Channel to Sink should really not have any > exceptions - if the sink or a plugin the sink is using is causing > rollbacks, then that should handle the failure cases/drop events etc. The > channel is pretty much a passive component just like a queue - "bad events" > are events sinks cannot handle due to some reason. The logic of handling > this should be in the sink itself. > > 2. Currently that is not an option, but if you need it, chances are there > are others who do too. Explain your use-case in a jira. Remember, Flume is > not a file streaming system, it is an event streaming one, so each file is > still converted into events by Flume. > > 3. If you think the current deserializers don't fit your use-case, you can > easily write your own and drop it in. > > > Thanks, > Hari > > On Wednesday, August 14, 2013 at 1:58 PM, Robert Heise wrote: > > Hello, > > As I continue to ramp up using Apache Flume (v1.3.0), I have observed a > few challenges and hoping somebody who has more experience can shed some > light. > > 1. Establishing a data pipeline is trivial, what I have noticed is that > any exceptions caught from the channel->sink operation invoke what appears > to be a repeating cycle of exceptions. As an example, any events which > cause an exception (java stacktrace) put the agent into a tailspin. There > are no tools for managing the pipeline to identify culprit events/files, > stopping, purging the channel, introspecting the pipeline etc. The best > course of action is to purge everything under file-channel and restart the > agent. I've read several posts posturing that using regex interceptors > could be a potential fix, but it is almost impossible to predict, in a > production environment, what exceptions are going to occur. In my opinion, > there has to be a declarative manner to move bad events out of the channel > to a "dead-letter-queue" or equivalent. > 2. I was hoping that the Spooling Directory Source would help us capture > file metadata, but nothing ever appears in the default .flumespool > trackerDir option? > 3. Maybe my use case is not the right fit for Flume, but my largest design > constraint is that we deal with files, everything we do is based on files. > I was hoping that the spooldir and batch control options would provide an > intuitive way to process files incoming to a spooldirectory, and ultimately > land that same data to HDFS. However, a file with 470,000 lines is > creating over 52MM events and because the tooling is week, I have no > visibility into why that many events are being created, where the agent is > in respect to completing. The data flow architecture is perfect, but maybe > Flume is best used for logs, tailing of files, etc, not necessarily > processing files? > > Thanks > > > -- *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [email protected] Pankaj Gupta | Software Engineer *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com United States | Canada | United Kingdom | Germany We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7> !
