I'm not aware of any documentation about this particular use case for Hadoop. I think your best bet is to look into the JNI documentation about loading native libraries, and go from there. - Aaron
On Sat, Apr 25, 2009 at 10:44 PM, amit handa <[email protected]> wrote: > Thanks Aaron, > > The processing libs that we use, which take time to load are all c++ based > .so libs. > Can i invoke it from JVM during the configure stage of the mapper and keep > it running as you suggested ? > Can you point me to some documentation regarding the same ? > > Regards, > Amit > > On Sat, Apr 25, 2009 at 1:42 PM, Aaron Kimball <[email protected]> wrote: > > > Amit, > > > > This can be made to work with Hadoop. Basically, in your mapper's > > "configure" stage it would do the heavy load-in process, then it would > > process your individual work items as records during the actual "map" > > stage. > > A map task can be comprised of many records, so you'll be fine here. > > > > If you use Hadoop 0.19 or 0.20, you can also enable JVM reuse, where > > multiple map tasks are performed serially in the same JVM instance. In > this > > case, the first task in the JVM would do the heavy load-in process into > > static fields or other globally-accessible items; subsequent tasks could > > recognize that the system state is already initialized and would not need > > to > > repeat it. > > > > The number of mapper/reducer tasks that run in parallel on a given node > can > > be configured with a simple setting; setting this to 6 will work just > fine. > > The capacity / fairshare schedulers are not what you need here -- their > > main > > function is to ensure that multiple jobs (separate sets of tasks) can all > > make progress simultaneously by sharing cluster resources across jobs > > rather > > than running jobs in a FIFO fashion. > > > > - Aaron > > > > On Sat, Apr 25, 2009 at 2:36 PM, amit handa <[email protected]> wrote: > > > > > Hi, > > > > > > We are planning to use hadoop for some very expensive and long running > > > processing tasks. > > > The computing nodes that we plan to use are very heavy in terms of CPU > > and > > > memory requirement e.g one process instance takes almost 100% CPU (1 > > core) > > > and around 300 -400 MB of RAM. > > > The first time the process loads it can take around 1-1:30 minutes but > > > after > > > that we can provide the data to process and it takes few seconds to > > > process. > > > Can I model it on hadoop ? > > > Can I have my processes pre-loaded on the task processing machines and > > the > > > data be provided by hadoop? This will save the 1-1:30 minutes of intial > > > load > > > time that it would otherwise take for each task. > > > I want to run a number of these processes in parallel based on the > > > machines > > > capacity (e.g 6 instances on a 8 cpu box) or using capacity scheduler. > > > > > > Please let me know if this is possible or any pointers to how it can be > > > done > > > ? > > > > > > Thanks, > > > Amit > > > > > >
