It looks reasonable. You can also try the treeAggregate (
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L89)
instead of normal aggregate if the driver needs to collect a large weight
vector from each partition. -Xiangrui


On Wed, Jul 30, 2014 at 1:16 AM, Tan Tim <unname...@gmail.com> wrote:

> I modify the code:
> lines.map(parsePoint).persist(StorageLever.MEMORY_ONLY)
> to
> lines.map(parsePoint).repartition(64).persist(StorageLever.MEMORY_ONLY)
>
> Every Stage run so fast, about 30 seconds(reduce from 3.5 minutes). But I
> found the total task reduce from 200 t0 64 after first stage just like this:
>
>
> But I don't know if this is reasonable.
> ​
>
>
> On Wed, Jul 30, 2014 at 2:11 PM, Xiangrui Meng <men...@gmail.com> wrote:
>
>> After you load the data in, call `.repartition(number of
>> executors).cache()`. If the data is evenly distributed, it may be hard
>> to guess the root cause. Do the two clusters have the same internode
>> bandwidth? -Xiangrui
>>
>> On Tue, Jul 29, 2014 at 11:06 PM, Tan Tim <unname...@gmail.com> wrote:
>> > input data is evenly distributed to the executors.
>> > ----
>> > The input data is on the HDFS, not on the spark clusters. How can I
>> make the
>> > data distributed to the excutors?
>> >
>> >
>> > On Wed, Jul 30, 2014 at 1:52 PM, Xiangrui Meng <men...@gmail.com>
>> wrote:
>> >>
>> >> The weight vector is usually dense and if you have many partitions,
>> >> the driver may slow down. You can also take a look at the driver
>> >> memory inside the Executor tab in WebUI. Another setting to check is
>> >> the HDFS block size and whether the input data is evenly distributed
>> >> to the executors. Are the hardware specs the same for the two
>> >> clusters? -Xiangrui
>> >>
>> >> On Tue, Jul 29, 2014 at 10:46 PM, Tan Tim <unname...@gmail.com> wrote:
>> >> > The application is Logistic Regression (OWLQN), we develop a sparse
>> >> > vector
>> >> > version. The feature dimesions is 1M+, but its very sparse. This
>> >> > appliction
>> >> > can run on another spark cluster, and every stage is about 50
>> seconds,
>> >> > and
>> >> > every executors have highly cpu usage. the only difference is OS(the
>> >> > faster
>> >> > one is ubuntu, and the slower on is centos).
>> >
>> >
>>
>
>

Reply via email to