I submitted a job in Yarn-Client mode, which simply reads from a HBase
table containing tens of millions of records and then does a *count *action.
The job runs for a much longer time than I expected, so I wonder whether it
was because the data to read was too much. Actually, there are 20 nodes in
my Hadoop cluster so the HBase table seems not so big (tens of millopns of
records). :

I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).

BTW, when the job was running, I can see logs on the console, and
specifically I'd like to know what the following log means:

14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as
TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL)
14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as
13454 bytes in 0 ms
14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426
ms on b04.jsepc.com (progress: 18/86)
14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19)


Thanks

Reply via email to