I submitted a job in Yarn-Client mode, which simply reads from a HBase table containing tens of millions of records and then does a *count *action. The job runs for a much longer time than I expected, so I wonder whether it was because the data to read was too much. Actually, there are 20 nodes in my Hadoop cluster so the HBase table seems not so big (tens of millopns of records). :
I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96). BTW, when the job was running, I can see logs on the console, and specifically I'd like to know what the following log means: 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL) 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as 13454 bytes in 0 ms 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426 ms on b04.jsepc.com (progress: 18/86) 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19) Thanks