I modify the code:
lines.map(parsePoint).persist(StorageLever.MEMORY_ONLY)
to
lines.map(parsePoint).repartition(64).persist(StorageLever.MEMORY_ONLY)

Every Stage run so fast, about 30 seconds(reduce from 3.5 minutes). But I
found the total task reduce from 200 t0 64 after first stage just like this:


But I don't know if this is reasonable.
​


On Wed, Jul 30, 2014 at 2:11 PM, Xiangrui Meng <men...@gmail.com> wrote:

> After you load the data in, call `.repartition(number of
> executors).cache()`. If the data is evenly distributed, it may be hard
> to guess the root cause. Do the two clusters have the same internode
> bandwidth? -Xiangrui
>
> On Tue, Jul 29, 2014 at 11:06 PM, Tan Tim <unname...@gmail.com> wrote:
> > input data is evenly distributed to the executors.
> > ----
> > The input data is on the HDFS, not on the spark clusters. How can I make
> the
> > data distributed to the excutors?
> >
> >
> > On Wed, Jul 30, 2014 at 1:52 PM, Xiangrui Meng <men...@gmail.com> wrote:
> >>
> >> The weight vector is usually dense and if you have many partitions,
> >> the driver may slow down. You can also take a look at the driver
> >> memory inside the Executor tab in WebUI. Another setting to check is
> >> the HDFS block size and whether the input data is evenly distributed
> >> to the executors. Are the hardware specs the same for the two
> >> clusters? -Xiangrui
> >>
> >> On Tue, Jul 29, 2014 at 10:46 PM, Tan Tim <unname...@gmail.com> wrote:
> >> > The application is Logistic Regression (OWLQN), we develop a sparse
> >> > vector
> >> > version. The feature dimesions is 1M+, but its very sparse. This
> >> > appliction
> >> > can run on another spark cluster, and every stage is about 50 seconds,
> >> > and
> >> > every executors have highly cpu usage. the only difference is OS(the
> >> > faster
> >> > one is ubuntu, and the slower on is centos).
> >
> >
>

Reply via email to