I am working on a hive SerDe where both SerDe and RecordReader need to have
access to an external resource with information.
This external resource could be on hdfs, in hbase, or on a http server.
This situation is very similar to what haivvreo does.

The way i go about it right now is that i store the uri for the external
resource in the SERDEPROPERTIES and then both SerDe and RecordReader use
that to load the resource. I had to jump through some hoops to retrieve the
Properties object (the SERDEPROPERTIES) in the RecordReader, but now it
works. However this is far from optimal, since on a large cluster this
leads to a lot of read request on the external resource.

Since SerDe gets called at least once on the client before the mapreduce
job is started, i would like to load my external resource there, and then
stuff it in the Configuration object, the Properties object or in the
Distributed Cache. Then the SerDes and RecordReaders on the cluster could
get it from there and wouldn't have to access the external resource.

I made the changes. But whatever modification i make to Configuration
object or Properties object on the client in SerDe doesn't make it to the
cluster! Is there a way to do this?

Reply via email to