Thanks Yanbo!
I check the Spark UI, and found that in Exp 1), there are 52 jobs and 99
stages, in Exp 2), there are 105 jobs and 206 stages. The time spent on
each jobs are 3s-4s, on each stages are 1-2s. That's why the Exp 2) take 2x
times than Exp 1).
And I also found that in Exp 2), the completed stage id isn't continuous,
say, the last completed stage id is 413, but # of completed stages is 206.
No fail stage. Do you know the reason?

Haoyue

2015-12-05 22:59 GMT+08:00 Yanbo Liang <yblia...@gmail.com>:

> Hi Haoyue,
>
> Could you find the time spent on each stage of the LinearRegression model
> training at the Spark UI?
> It can tell us which stage is the most time-consuming and help us to
> analyze the cause.
>
> Yanbo
>
> 2015-12-05 15:14 GMT+08:00 Haoyue Wang <whymoon...@gmail.com>:
>
>> Hi all,
>> I'm doing some experiment with Spark MLlib (version 1.5.0). I train
>> LogisticRegressionModel on a 2.06GB dataset (# of data: 2396130, # of
>> features: 3231961, # of classes: 2, format: LibSVM). I deployed Spark to a
>> 4 nodes cluster, each node's spec: CPU: Intel(R) Xeon(R) CPU E5-2650 0 @
>> 2.00GHz, 2 #CPUs * 8 #cores *2 #threads; Network: 40Gbps infiniband; RAM:
>> 256GB (spark configuration: driver 100GB, spark executor 100GB).
>>
>> I'm doing two experiments:
>> 1) Load data into Hive, and use HiveContext in Spark program to load data
>> from Hive into an DataFrame, parse the DataFrame into a
>> RDD<LabeledPoint>, then train LogisticRegressionModel on this RDD.
>> The training time is 389218 milliseconds.
>> 2) Load data from a Socket Server into an RDD, which have done some
>> feature transforming, add 5 features to each datum. So the # of features is
>> 3231966. Then repartition this RDD into 16 partitions, and parse the RDD
>> into  RDD<LabeledPoint>, finally train  LogisticRegressionModel on this
>> RDD.
>> The training time is 838470 milliseconds.
>>
>> The training time mentioned above is only the time of: final
>> LogisticRegressionModel model = new
>> LogisticRegressionWithLBFGS().setNumClasses(2).run(training.rdd());
>> not included the loading time and parsing time.
>>
>> So here is the question: why these two experiments' training time have
>> such a large difference? I suppose they should be similar but actually 2x.
>> I even tried to repartition the RDD into 4/32/64/128 partitions, and cache
>> them before training in the experiment 2, but doesn't make sense.
>>
>> Is there any inner difference between the RDD used for training in the
>> 2 experiments that  cause the difference of training time?
>> I will be appreciate if you can give me some guidance.
>>
>> Best,
>> Haoyue
>>
>
>

Reply via email to