I have a mapreduce job that requires expensive initialization (loading of some large dictionaries before processing).
I want to avoid executing this initialization more than necessary. I understand that I need to call setNumTasksToExecutePerJvm to -1 to force mapreduce to reuse JVMs when executing tasks. How I've been performing my initialization is, in my mapper, I override MapReduceBase#configure, read my parms from the JobConf, and load my dictionaries. It appears, from the tests I've run, that even though NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class are being created for each task, and therefore I'm still re-running this expensive initialization for each task. So, my question is: how can I avoid re-executing this expensive initialization per-task? Should I move my initialization code out of my mapper class and into my "main" class? If so, how do I pass references to the loaded dictionaries from my main class to my mapper? Thanks!
