Hi, Tao,
When I used newAPIHadoopRDD (Accumulo not HBase) I found that I had to
specify executor-memory and num-executors explicitly on the command line or
else I didn't get any parallelism across the cluster.

I used  --executor-memory 3G --num-executors 24 but obviously other
parameters will be better for your cluster.

-Russ

On Mon, Sep 29, 2014 at 7:43 PM, Nan Zhu <zhunanmcg...@gmail.com> wrote:

> can you look at your HBase UI to check whether your job is just reading
> from a single region server?
>
> Best,
>
> --
> Nan Zhu
>
> On Monday, September 29, 2014 at 10:21 PM, Tao Xiao wrote:
>
> I submitted a job in Yarn-Client mode, which simply reads from a HBase
> table containing tens of millions of records and then does a *count *action.
> The job runs for a much longer time than I expected, so I wonder whether it
> was because the data to read was too much. Actually, there are 20 nodes in
> my Hadoop cluster so the HBase table seems not so big (tens of millopns of
> records). :
>
> I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).
>
> BTW, when the job was running, I can see logs on the console, and
> specifically I'd like to know what the following log means:
>
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as
> TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL)
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as
> 13454 bytes in 0 ms
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426
> ms on b04.jsepc.com (progress: 18/86)
> 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19)
>
>
> Thanks
>
>
>

Reply via email to