Thanks for all your help, Aaron, Alan and Scott!
Kristi
Aaron Kimball wrote:
Hi Kristi :)
JobControl, as I understand it, runs on the client's machine and sends Job
objects to the JobTracker when their dependencies are all met. So once the
dependencies are met, they go into the JobTracker's normal scheduler queue.
There are three different scheduler implementations that you can choose
from. All of these support multiple concurrent jobs, if there are more task
slots available than tasks. So if a job contains 50 map tasks, and there are
64 map task slots (e.g., 8 machines with a tasktracker.max.map.tasks == 8)
available, then the whole job will be scheduled, then the first 14 tasks
from the next job can run simultaneously.
The schedulers differ in what happens when a cluster is already loaded up
with work.
The default scheduler is the FIFO scheduler. Jobs are in a queue and all
tasks from job 1 will be satisfied before any tasks from job 2 are
dispatched to task trackers.
There's another scheduler called the FairScheduler. The user configures a
set of "pools" on the cluster. Each job, when created, is bound to a
specific pool. Each pool is guaranteed a minimum share of the scheduler.
e.g., my pool may guarantee me a minimum of 3 map task slots and 1 reduce
slot. If the whole cluster is loaded up, then the scheduler will prioritise
my tasks up in its dispatch line until I'm getting my 3 task minimum. If
there are multiple jobs which are tagged with different pools, then the fair
scheduler's algorithm balances between them. The "aaron" pool may have a 3
task minimum. The "bob" pool may have a 6 task minimum. There might be many
more task slots available than 9. If aaron and bob both submit jobs at once,
then aaron will get 3/9 of the available task slots. bob will get 6/9. But
all tasks from a given user are usually put in the same pool, so they will
run in the same shared set of resources. Multiple tasks in a pool can run in
parallel if the pool has slots available.
Finally, there's a third scheduler called the Capacity scheduler. It's
similar to the fair scheduler, in that it allows guarantees of minimum
availability for different pools. I don't know how it apportions additional
extra resources though -- this is the one I'm least familiar with. Someone
else will have to chime in here.
- Aaron
On Fri, Jun 5, 2009 at 12:19 PM, Scott Carey <[email protected]>wrote:
Even more general context:
Cascading does something similar, but I am not sure if it uses Hadoop's
JobControl or manages dependencies itself. It definitely runs multiple
jobs
in parallel when the dependencies allow it.
On 6/5/09 11:44 AM, "Alan Gates" <[email protected]> wrote:
To add a little context, Pig uses Hadoop's JobControl to schedule it's
jobs. Pig defines the dependencies between jobs in JobControl, and
then submits the entire graph of jobs. So, using JobControl, does
Hadoop schedule jobs serially or in parallel (assuming no dependencies)?
Alan.
On Jun 5, 2009, at 10:50 AM, Kristi Morton wrote:
Hi Pankil,
Sorry about having to send my question email twice to the list...
the first time I sent it I had forgotten to subscribe to the list.
I resent it after subscribing, and your response to the first email
I sent did not make it into my inbox. I saw your response on the
archives list.
So, to recap, you said:
"We are not able to carry out all joins in a single job..we also
tried our hadoop code using
Pig scripts and found that for each join in PIG script new job is
used.So
basically what i think its a sequential process to handle typesof
join where
output of one job is required s an input to other one."
I, too, have seen this sequential behavior with joins. However, it
seems like it could be possible for there to be two jobs executing
in parallel whose output is the input to the subsequent job. Is
this possible or are all jobs scheduled sequentially?
Thanks,
Kristi