All, We have a small table that we use the map-join technique to join to several large tables in separate hive query scripts.
As I understand it, the map-join will do some preparatory work to get the small table into the distributed cache for the map-join. These steps are (from my understanding): 1) Read the small table into a hash table 2) Serialize to a file on disk 3) Tar the file 4) Put the tar in the distributed cache. Now assume that we have small table A and we map-join it to B and later map-join it to C. I presume that hive will repeat the above preparatory steps for each map-join. After all, it can't know that table A has not changed between scripts. Questions: a) Does this work get repeated with each map-join? b) Any way to tell hive to re-use previous map-join small tables already cached? c) Any way to do something like this manually and put the file into the distributed cache manually and get the map-join to re-use it? d) How does hive manage such distributed cache objects? Are they cleaned out at the end of the hive query? Thanks, Mark