Hi Ed! This is definitely doable. What you want is an intercepter on source A that will do the conversion from log lines to Avro. The easiest way to do this would probably be to use Morphlines[1] from the Kite SDK[2]. This will let you put all of your transformation logic into a configuration file that's easy to change at runtime.
In particular, check out the example of parsing syslog lines[3]. The documentation for configure a Morphlines Intercepter is available on the Flume docs site[4]. You also have to use the Static Intercepter[5] to add the Avro schema to the Flume event header. Also, you might consider using the Kite DatasetSink[6] rather than the default Avro serializer. Don't let the "experimental" label scare you, it's quite stable when writing to an Avro dataset (the default) and it will have support for Parquet in the next release of Flume. There is a rather complete example of this available in the Kite Examples project[7], though with JSON as the source and using the regular HDFS sink. I've been working to update the Kite examples to use the DatasetSink[8], but I haven't gotten around to the JSON example yet. Let me know if you think it will be useful and I'll post an update in the same branch. Cheers, -Joey [1] http://kitesdk.org/docs/current/kite-morphlines/index.html [2] http://kitesdk.org/docs/current/ [3] http://kitesdk.org/docs/current/kite-morphlines/index.html#/SyslogUseCase-Requirements [4] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor [5] http://flume.apache.org/FlumeUserGuide.html#static-interceptor [6] http://flume.apache.org/FlumeUserGuide.html#kite-dataset-sink-experimental [7] https://github.com/kite-sdk/kite-examples/tree/master/json [8] https://github.com/joey/kite-examples/tree/cdk-647-datasetsink-flume-log4j-appender On Mon, Sep 8, 2014 at 12:00 PM, Ed Judge <ejud...@gmail.com> wrote: > Thanks for the reply. My understanding of the current avro sink/source is > that the schema is just a simple header/body schema. What I ultimately > want to do is read a log file and write it into an avro file on a HDFS at > sink B. However, before that I would like to parse the log file at source > A first and come up with a more detailed schema which might include for > example date, priority, subsystem, and log message fields. I am trying to > “normalize’ the events so that we could also potentially filter certain log > file fields at the source. Does this make sense? > > On Sep 6, 2014, at 11:52 AM, Ashish <paliwalash...@gmail.com> wrote: > > I am not sure I understand the question correctly, let me try to answer > based on my understanding > > source A -> channel A -> sink A ———> source B -> channel B -> sink B > > For the scenario, Sink A has to be an Avro sink and Source B has to be an > Avro Source for the flow to work. > Flume would use avro for RPC (look at > flume-ng-sdk/src/main/avro/flume.avdl). It defines how Flume would send > Event(s) across using Avro RPC. > > 1. Source A (spooled dir) would read files and create Events from it and > insert into channel > 2. Sink A (Avro sink) would read from Channel, and would translate Event > into AvroFlumeEvent for sending to Source B (Avro Source) > 3. Source B would read from AvroFlumeEvent and create an Event and insert > into channel, which shall be processed by Sink B > > It's not that event would be wrapped into events as it traverse down the > chain. Avro encoding would just exist between Sink A and Source B. > > Based on my understanding, you are looking at encoding log file lines > using avro. In that case, the avro encoded log file lines would be part of > Event body, rest would be same as Step 1-3 > > HTH ! > > > On Fri, Sep 5, 2014 at 12:58 AM, Ed Judge <ejud...@gmail.com> wrote: > >> Ok, I have looked over the source and it is making a little more sense. >> >> I think what I ultimately want to do is this: >> >> source A -> channel A -> sink A ———> source B -> channel B -> sink B >> >> source A will be looking at a log file. Each line of the log file will >> have a certain format/schema. I would write Source A such that it could >> write the schema/line as an event into the channel and pass that through >> the system all the way ultimately to sink B so that it would know the >> schema also. >> I was thinking Avro would be a good format for source A to use when >> writing into it’s channel. If Sink A is an existing Avro Sink and Source B >> is an exiting Avro source, would this still work? Does this mean I would >> have 2 Avro headers (one encapsulating the other) which wasteful or can the >> existing Avro source and sink deal with this unmodified? Is there a better >> way to accomplish what I want to do? Just looking for some guidance. >> >> Thanks, >> Ed >> >> On Sep 4, 2014, at 4:44 AM, Ashish <paliwalash...@gmail.com> wrote: >> >> Avro records shall have the schema embedded with them. Have a look at >> source, that shall help a bit >> >> >> On Wed, Sep 3, 2014 at 10:30 PM, Ed Judge <ejud...@gmail.com> wrote: >> >>> That’s helpful but isn’t there some type of Avro schema negotiation that >>> occurs? >>> >>> -Ed >>> >>> On Sep 3, 2014, at 12:02 AM, Jeff Lord <jl...@cloudera.com> wrote: >>> >>> Ed, >>> >>> Did you take a look at the javadoc in the source? >>> Basically the source uses netty as a server and the sink is just an rpc >>> client. >>> If you read over the doc which is in the two links below and take a look >>> at the developer guide and still have questions just ask away and someone >>> will help to answer. >>> >>> >>> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/source/AvroSource.java >>> >>> >>> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/sink/AvroSink.java >>> >>> https://flume.apache.org/FlumeDeveloperGuide.html#transaction-interface >>> >>> -Jeff >>> >>> >>> >>> >>> >>> >>> On Tue, Sep 2, 2014 at 6:36 PM, Ed Judge <ejud...@gmail.com> wrote: >>> >>>> Does anyone know of any good documentation that talks about the >>>> protocol/negotiation used between an Avro source and sink? >>>> >>>> Thanks, >>>> Ed >>>> >>>> >>> >>> >> >> >> -- >> thanks >> ashish >> >> Blog: http://www.ashishpaliwal.com/blog >> My Photo Galleries: http://www.pbase.com/ashishpaliwal >> >> >> > > > -- > thanks > ashish > > Blog: http://www.ashishpaliwal.com/blog > My Photo Galleries: http://www.pbase.com/ashishpaliwal > > > -- Joey Echeverria