I am working on a hive SerDe where both SerDe and RecordReader need to have access to an external resource with information. This external resource could be on hdfs, in hbase, or on a http server. This situation is very similar to what haivvreo does.
The way i go about it right now is that i store the uri for the external resource in the SERDEPROPERTIES and then both SerDe and RecordReader use that to load the resource. I had to jump through some hoops to retrieve the Properties object (the SERDEPROPERTIES) in the RecordReader, but now it works. However this is far from optimal, since on a large cluster this leads to a lot of read request on the external resource. Since SerDe gets called at least once on the client before the mapreduce job is started, i would like to load my external resource there, and then stuff it in the Configuration object, the Properties object or in the Distributed Cache. Then the SerDes and RecordReaders on the cluster could get it from there and wouldn't have to access the external resource. I made the changes. But whatever modification i make to Configuration object or Properties object on the client in SerDe doesn't make it to the cluster! Is there a way to do this?