All,

We have a small table that we use the map-join technique to join to several 
large tables in separate hive query scripts.

As I understand it, the map-join will do some preparatory work to get the small 
table into the distributed cache for the map-join.  These steps are (from my 
understanding):


1)      Read the small table into a hash table

2)      Serialize to a file on disk

3)      Tar the file

4)      Put the tar in the distributed cache.

Now assume that we have small table A and we map-join it to B and later 
map-join it to C.

I presume that hive will repeat the above preparatory steps for each map-join.  
After all, it can't know that table A has not changed between scripts.

Questions:


a)      Does this work get repeated with each map-join?

b)      Any way to tell hive to re-use previous map-join small tables already 
cached?

c)      Any way to do something like this manually and put the file into the 
distributed cache manually and get the map-join to re-use it?

d)      How does hive manage such distributed cache objects?  Are they cleaned 
out at the end of the hive query?

Thanks,
Mark

Reply via email to