Problem turned out to be in too high 'spark.default.parallelism',
BinaryClassificationMetrics are doing combineByKey which internally shuffle
train dataset. Lower parallelism + cutting train set RDD history with
save/read into parquet solved the problem. Thanks for hint!

On Wed, Sep 23, 2015 at 11:10 PM, DB Tsai <dbt...@dbtsai.com> wrote:

> You want to reduce the # of partitions to around the # of executors *
> cores. Since you have so many tasks/partitions which will give a lot of
> pressure on treeReduce in LoR. Let me know if this helps.
>
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>
>
> On Wed, Sep 23, 2015 at 5:39 PM, Eugene Zhulenev <
> eugene.zhule...@gmail.com> wrote:
>
>> ~3000 features, pretty sparse, I think about 200-300 non zero features in
>> each row. We have 100 executors x 8 cores. Number of tasks is pretty big,
>> 30k-70k, can't remember exact number. Training set is a result of pretty
>> big join from multiple data frames, but it's cached. However as I
>> understand Spark still keeps DAG history of RDD to be able to recover it in
>> case of failure of one of the nodes.
>>
>> I'll try tomorrow to save train set as parquet, load it back as DataFrame
>> and run modeling this way.
>>
>> On Wed, Sep 23, 2015 at 7:56 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>
>>> Your code looks correct for me. How many # of features do you have in
>>> this training? How many tasks are running in the job?
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ----------------------------------------------------------
>>> Blog: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>
>>>
>>> On Wed, Sep 23, 2015 at 4:38 PM, Eugene Zhulenev <
>>> eugene.zhule...@gmail.com> wrote:
>>>
>>>> It's really simple:
>>>> https://gist.github.com/ezhulenev/7777886517723ca4a353
>>>>
>>>> The same strange heap behavior we've seen even for single model, it
>>>> takes ~20 gigs heap on a driver to build single model with less than 1
>>>> million rows in input data frame.
>>>>
>>>> On Wed, Sep 23, 2015 at 6:31 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>>
>>>>> Could you paste some of your code for diagnosis?
>>>>>
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> DB Tsai
>>>>> ----------------------------------------------------------
>>>>> Blog: https://www.dbtsai.com
>>>>> PGP Key ID: 0xAF08DF8D
>>>>> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>
>>>>>
>>>>> On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zhulenev <
>>>>> eugene.zhule...@gmail.com> wrote:
>>>>>
>>>>>> We are running Apache Spark 1.5.0 (latest code from 1.5 branch)
>>>>>>
>>>>>> We are running 2-3 LogisticRegression models in parallel (we'd love
>>>>>> to run 10-20 actually), they are not really big at all, maybe 1-2 million
>>>>>> rows in each model.
>>>>>>
>>>>>> Cluster itself, and all executors look good. Enough free memory and
>>>>>> no exceptions or errors.
>>>>>>
>>>>>> However I see very strange behavior inside Spark driver. Allocated
>>>>>> heap constantly growing. It grows up to 30 gigs in 1.5 hours and then
>>>>>> everything becomes super sloooooow.
>>>>>>
>>>>>> We don't do any collect, and I really don't understand who is
>>>>>> consuming all this memory. Looks like it's something inside
>>>>>> LogisticRegression itself, however I only see treeAggregate which should
>>>>>> not require so much memory to run.
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>> Plus I don't see any GC pause, looks like memory is still used by
>>>>>> someone inside driver.
>>>>>>
>>>>>> [image: Inline image 2]
>>>>>> [image: Inline image 1]
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to