Hi,

I have two small RDD, each has about 600 records. In my code, I did

val rdd1 = sc...cache()
val rdd2 = sc...cache()

val result = rdd1.cartesian(rdd2).*repartition*(num_cpu).map {case (a,b) =>
  some_expensive_job(a,b)
}

I ran my job in YARN cluster with "--master yarn-cluster", I have 6
executor, and each has a large memory volume.

However, I noticed my job is very slow. I went to the RM page, and found
there are two containers, one is the driver, one is the worker. I guess
this is correct?

I went to the worker's log, and monitor the log detail. My app print some
information, so I can use them to estimate the progress of the "map"
operation. Looking at the log, it feels like the jobs are done one by one
sequentially, rather than #cpu batch at a time.

I checked the worker node, and their CPU are all busy.



[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>

Reply via email to