Hello,

As I continue to ramp up using Apache Flume (v1.3.0), I have observed a few 
challenges and hoping somebody who has more experience can shed some light.  

1. Establishing a data pipeline is trivial, what I have noticed is that any 
exceptions caught from the channel->sink operation invoke what appears to be a 
repeating cycle of exceptions.  As an example, any events which cause an 
exception (java stacktrace) put the agent into a tailspin.  There are no tools 
for managing the pipeline to identify culprit events/files, stopping, purging 
the channel, introspecting the pipeline etc.  The best course of action is to 
purge everything under file-channel and restart the agent.  I've read several 
posts posturing that using regex interceptors could be a potential fix, but it 
is almost impossible to predict, in a production environment, what exceptions 
are going to occur.  In my opinion, there has to be a declarative manner to 
move bad events out of the channel to a "dead-letter-queue" or equivalent. 
2.  I was hoping that the Spooling Directory Source would help us capture file 
metadata, but nothing ever appears in the default .flumespool trackerDir option?
3. Maybe my use case is not the right fit for Flume, but my largest design 
constraint is that we deal with files, everything we do is based on files.  I 
was hoping that the spooldir and batch control options would provide an 
intuitive way to process files incoming to a spooldirectory, and ultimately 
land that same data to HDFS.  However, a file with 470,000 lines is creating 
over 52MM events and because the tooling is week, I have no visibility into why 
that many events are being created, where the agent is in respect to 
completing.  The data flow architecture is perfect, but maybe Flume is best 
used for logs, tailing of files, etc, not necessarily processing files?

Thanks

Reply via email to