Querying nested Avro data stored in Flume events

Manuel Simoni Tue, 12 Mar 2013 04:50:27 -0700

Hi!

I'm planning to use Hive to query custom Avro logging records. I
transfer data via Flume to HDFS and pick it up from there


The Flume event schema is
{"type":"record","name":"Event","fields":[{"name":"headers","type":{"type":"map","values":"string"}},{"name":"body","type":"bytes"}]}
which means that my custom records appear to Hive as the binary .body
record field.

There are three ways that I see for querying my custom record fields:

(1) Store my custom record fields as Flume event headers. Then Hive
can query them as-is, out of the box.

(2) Use a different Flume event serializer that doesn't use the Flume
event schema in HDFS, but rather my schema directly. This requires a
custom Flume setup.

(3) Modify data on import to Hive, or create a view on the data once
it is in Hive.

Do you have any suggestions for how to go about this, or which route
is preferable?

Best regards,
Manuel

-- 
Manuel Simoni, Engineering Consultant
msim...@gmail.com | Tel: +43 (0)664 346 5158 | Skype: manuelsimoni

Querying nested Avro data stored in Flume events

Reply via email to