Joey, Thank you very much for the info. Let me look this over and digest it.
Thanks, Ed. Sent from my iPhone > On Sep 8, 2014, at 4:27 PM, Joey Echeverria <j...@cloudera.com> wrote: > > Hi Ed! > > This is definitely doable. What you want is an intercepter on source A that > will do the conversion from log lines to Avro. The easiest way to do this > would probably be to use Morphlines[1] from the Kite SDK[2]. This will let > you put all of your transformation logic into a configuration file that's > easy to change at runtime. > > In particular, check out the example of parsing syslog lines[3]. The > documentation for configure a Morphlines Intercepter is available on the > Flume docs site[4]. You also have to use the Static Intercepter[5] to add the > Avro schema to the Flume event header. > > Also, you might consider using the Kite DatasetSink[6] rather than the > default Avro serializer. Don't let the "experimental" label scare you, it's > quite stable when writing to an Avro dataset (the default) and it will have > support for Parquet in the next release of Flume. > > There is a rather complete example of this available in the Kite Examples > project[7], though with JSON as the source and using the regular HDFS sink. > > I've been working to update the Kite examples to use the DatasetSink[8], but > I haven't gotten around to the JSON example yet. Let me know if you think it > will be useful and I'll post an update in the same branch. > > Cheers, > > -Joey > > > [1] http://kitesdk.org/docs/current/kite-morphlines/index.html > [2] http://kitesdk.org/docs/current/ > [3] > http://kitesdk.org/docs/current/kite-morphlines/index.html#/SyslogUseCase-Requirements > [4] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor > [5] http://flume.apache.org/FlumeUserGuide.html#static-interceptor > [6] http://flume.apache.org/FlumeUserGuide.html#kite-dataset-sink-experimental > [7] https://github.com/kite-sdk/kite-examples/tree/master/json > [8] > https://github.com/joey/kite-examples/tree/cdk-647-datasetsink-flume-log4j-appender > >> On Mon, Sep 8, 2014 at 12:00 PM, Ed Judge <ejud...@gmail.com> wrote: >> Thanks for the reply. My understanding of the current avro sink/source is >> that the schema is just a simple header/body schema. What I ultimately want >> to do is read a log file and write it into an avro file on a HDFS at sink B. >> However, before that I would like to parse the log file at source A first >> and come up with a more detailed schema which might include for example >> date, priority, subsystem, and log message fields. I am trying to >> “normalize’ the events so that we could also potentially filter certain log >> file fields at the source. Does this make sense? >> >>> On Sep 6, 2014, at 11:52 AM, Ashish <paliwalash...@gmail.com> wrote: >>> >>> I am not sure I understand the question correctly, let me try to answer >>> based on my understanding >>> >>> source A -> channel A -> sink A ———> source B -> channel B -> sink B >>> >>> For the scenario, Sink A has to be an Avro sink and Source B has to be an >>> Avro Source for the flow to work. >>> Flume would use avro for RPC (look at >>> flume-ng-sdk/src/main/avro/flume.avdl). It defines how Flume would send >>> Event(s) across using Avro RPC. >>> >>> 1. Source A (spooled dir) would read files and create Events from it and >>> insert into channel >>> 2. Sink A (Avro sink) would read from Channel, and would translate Event >>> into AvroFlumeEvent for sending to Source B (Avro Source) >>> 3. Source B would read from AvroFlumeEvent and create an Event and insert >>> into channel, which shall be processed by Sink B >>> >>> It's not that event would be wrapped into events as it traverse down the >>> chain. Avro encoding would just exist between Sink A and Source B. >>> >>> Based on my understanding, you are looking at encoding log file lines using >>> avro. In that case, the avro encoded log file lines would be part of Event >>> body, rest would be same as Step 1-3 >>> >>> HTH ! >>> >>> >>>> On Fri, Sep 5, 2014 at 12:58 AM, Ed Judge <ejud...@gmail.com> wrote: >>>> Ok, I have looked over the source and it is making a little more sense. >>>> >>>> I think what I ultimately want to do is this: >>>> >>>> source A -> channel A -> sink A ———> source B -> channel B -> sink B >>>> >>>> source A will be looking at a log file. Each line of the log file will >>>> have a certain format/schema. I would write Source A such that it could >>>> write the schema/line as an event into the channel and pass that through >>>> the system all the way ultimately to sink B so that it would know the >>>> schema also. >>>> I was thinking Avro would be a good format for source A to use when >>>> writing into it’s channel. If Sink A is an existing Avro Sink and Source >>>> B is an exiting Avro source, would this still work? Does this mean I >>>> would have 2 Avro headers (one encapsulating the other) which wasteful or >>>> can the existing Avro source and sink deal with this unmodified? Is there >>>> a better way to accomplish what I want to do? Just looking for some >>>> guidance. >>>> >>>> Thanks, >>>> Ed >>>> >>>>> On Sep 4, 2014, at 4:44 AM, Ashish <paliwalash...@gmail.com> wrote: >>>>> >>>>> Avro records shall have the schema embedded with them. Have a look at >>>>> source, that shall help a bit >>>>> >>>>> >>>>>> On Wed, Sep 3, 2014 at 10:30 PM, Ed Judge <ejud...@gmail.com> wrote: >>>>>> That’s helpful but isn’t there some type of Avro schema negotiation that >>>>>> occurs? >>>>>> >>>>>> -Ed >>>>>> >>>>>>> On Sep 3, 2014, at 12:02 AM, Jeff Lord <jl...@cloudera.com> wrote: >>>>>>> >>>>>>> Ed, >>>>>>> >>>>>>> Did you take a look at the javadoc in the source? >>>>>>> Basically the source uses netty as a server and the sink is just an rpc >>>>>>> client. >>>>>>> If you read over the doc which is in the two links below and take a >>>>>>> look at the developer guide and still have questions just ask away and >>>>>>> someone will help to answer. >>>>>>> >>>>>>> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/source/AvroSource.java >>>>>>> >>>>>>> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/sink/AvroSink.java >>>>>>> >>>>>>> https://flume.apache.org/FlumeDeveloperGuide.html#transaction-interface >>>>>>> >>>>>>> -Jeff >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Tue, Sep 2, 2014 at 6:36 PM, Ed Judge <ejud...@gmail.com> wrote: >>>>>>>> Does anyone know of any good documentation that talks about the >>>>>>>> protocol/negotiation used between an Avro source and sink? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ed >>>>> >>>>> >>>>> >>>>> -- >>>>> thanks >>>>> ashish >>>>> >>>>> Blog: http://www.ashishpaliwal.com/blog >>>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal >>> >>> >>> >>> -- >>> thanks >>> ashish >>> >>> Blog: http://www.ashishpaliwal.com/blog >>> My Photo Galleries: http://www.pbase.com/ashishpaliwal > > > > -- > Joey Echeverria