Joey,
Thank you very much for the info. Let me look this over and digest it. 

Thanks,
Ed. 

Sent from my iPhone


> On Sep 8, 2014, at 4:27 PM, Joey Echeverria <j...@cloudera.com> wrote:
> 
> Hi Ed!
> 
> This is definitely doable. What you want is an intercepter on source A that 
> will do the conversion from log lines to Avro. The easiest way to do this 
> would probably be to use Morphlines[1] from the Kite SDK[2]. This will let 
> you put all of your transformation logic into a configuration file that's 
> easy to change at runtime.
> 
> In particular, check out the example of parsing syslog lines[3]. The 
> documentation for configure a Morphlines Intercepter is available on the 
> Flume docs site[4]. You also have to use the Static Intercepter[5] to add the 
> Avro schema to the Flume event header.
> 
> Also, you might consider using the Kite DatasetSink[6] rather than the 
> default Avro serializer. Don't let the "experimental" label scare you, it's 
> quite stable when writing to an Avro dataset (the default) and it will have 
> support for Parquet in the next release of Flume.
> 
> There is a rather complete example of this available in the Kite Examples 
> project[7], though with JSON as the source and using the regular HDFS sink.
> 
> I've been working to update the Kite examples to use the DatasetSink[8], but 
> I haven't gotten around to the JSON example yet. Let me know if you think it 
> will be useful and I'll post an update in the same branch.
> 
> Cheers,
> 
> -Joey
> 
> 
> [1] http://kitesdk.org/docs/current/kite-morphlines/index.html
> [2] http://kitesdk.org/docs/current/
> [3] 
> http://kitesdk.org/docs/current/kite-morphlines/index.html#/SyslogUseCase-Requirements
> [4] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
> [5] http://flume.apache.org/FlumeUserGuide.html#static-interceptor
> [6] http://flume.apache.org/FlumeUserGuide.html#kite-dataset-sink-experimental
> [7] https://github.com/kite-sdk/kite-examples/tree/master/json
> [8] 
> https://github.com/joey/kite-examples/tree/cdk-647-datasetsink-flume-log4j-appender
> 
>> On Mon, Sep 8, 2014 at 12:00 PM, Ed Judge <ejud...@gmail.com> wrote:
>> Thanks for the reply.  My understanding of the current avro sink/source is 
>> that the schema is just a simple header/body schema.  What I ultimately want 
>> to do is read a log file and write it into an avro file on a HDFS at sink B. 
>>  However, before that I would like to parse the log file at source A first 
>> and come up with a more detailed schema which might include for example 
>> date, priority, subsystem, and log message fields.  I am trying to 
>> “normalize’ the events so that we could also potentially filter certain log 
>> file fields at the source.  Does this make sense?
>> 
>>> On Sep 6, 2014, at 11:52 AM, Ashish <paliwalash...@gmail.com> wrote:
>>> 
>>> I am not sure I understand the question correctly, let me try to answer 
>>> based on my understanding
>>> 
>>> source A -> channel A -> sink A ———> source B -> channel B -> sink B
>>> 
>>> For the scenario, Sink A has to be an Avro sink and Source B has to be an 
>>> Avro Source for the flow to work.
>>> Flume would use avro for RPC (look at 
>>> flume-ng-sdk/src/main/avro/flume.avdl). It defines how Flume would send 
>>> Event(s) across using Avro RPC. 
>>> 
>>> 1. Source A (spooled dir) would read files and create Events from it and 
>>> insert into channel
>>> 2. Sink A (Avro sink) would read from Channel, and would translate Event 
>>> into AvroFlumeEvent for sending to Source B (Avro Source)
>>> 3. Source B would read from AvroFlumeEvent and create an Event and insert 
>>> into channel, which shall be processed by Sink B
>>> 
>>> It's not that event would be wrapped into events as it traverse down the 
>>> chain. Avro encoding would just exist between Sink A and Source B. 
>>> 
>>> Based on my understanding, you are looking at encoding log file lines using 
>>> avro. In that case, the avro encoded log file lines would be part of Event 
>>> body, rest would be same as Step 1-3
>>> 
>>> HTH !
>>> 
>>> 
>>>> On Fri, Sep 5, 2014 at 12:58 AM, Ed Judge <ejud...@gmail.com> wrote:
>>>> Ok, I have looked over the source and it is making a little more sense.
>>>> 
>>>> I think what I ultimately want to do is this:
>>>> 
>>>> source A -> channel A -> sink A ———> source B -> channel B -> sink B
>>>> 
>>>> source A will be looking at a log file.  Each line of the log file will 
>>>> have a certain format/schema. I would write Source A such that it could 
>>>> write the schema/line as an event into the channel and pass that through 
>>>> the system all the way ultimately to sink B so that it would know the 
>>>> schema also.  
>>>> I was thinking Avro would be a good format for source A to use when 
>>>> writing into it’s channel.  If Sink A is an existing Avro Sink and Source 
>>>> B is an exiting Avro source, would this still work?  Does this mean I 
>>>> would have 2 Avro headers (one encapsulating the other) which wasteful or 
>>>> can the existing Avro source and sink deal with this unmodified?  Is there 
>>>> a better way to accomplish what I want to do?  Just looking for some 
>>>> guidance.
>>>> 
>>>> Thanks,
>>>> Ed
>>>> 
>>>>> On Sep 4, 2014, at 4:44 AM, Ashish <paliwalash...@gmail.com> wrote:
>>>>> 
>>>>> Avro records shall have the schema embedded with them. Have a look at 
>>>>> source, that shall help a bit
>>>>> 
>>>>> 
>>>>>> On Wed, Sep 3, 2014 at 10:30 PM, Ed Judge <ejud...@gmail.com> wrote:
>>>>>> That’s helpful but isn’t there some type of Avro schema negotiation that 
>>>>>> occurs?
>>>>>> 
>>>>>> -Ed
>>>>>> 
>>>>>>> On Sep 3, 2014, at 12:02 AM, Jeff Lord <jl...@cloudera.com> wrote:
>>>>>>> 
>>>>>>> Ed,
>>>>>>> 
>>>>>>> Did you take a look at the javadoc in the source?
>>>>>>> Basically the source uses netty as a server and the sink is just an rpc 
>>>>>>> client.
>>>>>>> If you read over the doc which is in the two links below and take a 
>>>>>>> look at the developer guide and still have questions just ask away and 
>>>>>>> someone will help to answer.
>>>>>>> 
>>>>>>> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/source/AvroSource.java
>>>>>>> 
>>>>>>> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/sink/AvroSink.java
>>>>>>> 
>>>>>>> https://flume.apache.org/FlumeDeveloperGuide.html#transaction-interface
>>>>>>> 
>>>>>>> -Jeff
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Tue, Sep 2, 2014 at 6:36 PM, Ed Judge <ejud...@gmail.com> wrote:
>>>>>>>> Does anyone know of any good documentation that talks about the 
>>>>>>>> protocol/negotiation used between an Avro source and sink?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Ed
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> thanks
>>>>> ashish
>>>>> 
>>>>> Blog: http://www.ashishpaliwal.com/blog
>>>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>> 
>>> 
>>> 
>>> -- 
>>> thanks
>>> ashish
>>> 
>>> Blog: http://www.ashishpaliwal.com/blog
>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
> 
> 
> 
> -- 
> Joey Echeverria

Reply via email to