Problem turned out to be in too high 'spark.default.parallelism', BinaryClassificationMetrics are doing combineByKey which internally shuffle train dataset. Lower parallelism + cutting train set RDD history with save/read into parquet solved the problem. Thanks for hint!
On Wed, Sep 23, 2015 at 11:10 PM, DB Tsai <dbt...@dbtsai.com> wrote: > You want to reduce the # of partitions to around the # of executors * > cores. Since you have so many tasks/partitions which will give a lot of > pressure on treeReduce in LoR. Let me know if this helps. > > > Sincerely, > > DB Tsai > ---------------------------------------------------------- > Blog: https://www.dbtsai.com > PGP Key ID: 0xAF08DF8D > <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D> > > On Wed, Sep 23, 2015 at 5:39 PM, Eugene Zhulenev < > eugene.zhule...@gmail.com> wrote: > >> ~3000 features, pretty sparse, I think about 200-300 non zero features in >> each row. We have 100 executors x 8 cores. Number of tasks is pretty big, >> 30k-70k, can't remember exact number. Training set is a result of pretty >> big join from multiple data frames, but it's cached. However as I >> understand Spark still keeps DAG history of RDD to be able to recover it in >> case of failure of one of the nodes. >> >> I'll try tomorrow to save train set as parquet, load it back as DataFrame >> and run modeling this way. >> >> On Wed, Sep 23, 2015 at 7:56 PM, DB Tsai <dbt...@dbtsai.com> wrote: >> >>> Your code looks correct for me. How many # of features do you have in >>> this training? How many tasks are running in the job? >>> >>> >>> Sincerely, >>> >>> DB Tsai >>> ---------------------------------------------------------- >>> Blog: https://www.dbtsai.com >>> PGP Key ID: 0xAF08DF8D >>> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D> >>> >>> On Wed, Sep 23, 2015 at 4:38 PM, Eugene Zhulenev < >>> eugene.zhule...@gmail.com> wrote: >>> >>>> It's really simple: >>>> https://gist.github.com/ezhulenev/7777886517723ca4a353 >>>> >>>> The same strange heap behavior we've seen even for single model, it >>>> takes ~20 gigs heap on a driver to build single model with less than 1 >>>> million rows in input data frame. >>>> >>>> On Wed, Sep 23, 2015 at 6:31 PM, DB Tsai <dbt...@dbtsai.com> wrote: >>>> >>>>> Could you paste some of your code for diagnosis? >>>>> >>>>> >>>>> Sincerely, >>>>> >>>>> DB Tsai >>>>> ---------------------------------------------------------- >>>>> Blog: https://www.dbtsai.com >>>>> PGP Key ID: 0xAF08DF8D >>>>> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D> >>>>> >>>>> On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zhulenev < >>>>> eugene.zhule...@gmail.com> wrote: >>>>> >>>>>> We are running Apache Spark 1.5.0 (latest code from 1.5 branch) >>>>>> >>>>>> We are running 2-3 LogisticRegression models in parallel (we'd love >>>>>> to run 10-20 actually), they are not really big at all, maybe 1-2 million >>>>>> rows in each model. >>>>>> >>>>>> Cluster itself, and all executors look good. Enough free memory and >>>>>> no exceptions or errors. >>>>>> >>>>>> However I see very strange behavior inside Spark driver. Allocated >>>>>> heap constantly growing. It grows up to 30 gigs in 1.5 hours and then >>>>>> everything becomes super sloooooow. >>>>>> >>>>>> We don't do any collect, and I really don't understand who is >>>>>> consuming all this memory. Looks like it's something inside >>>>>> LogisticRegression itself, however I only see treeAggregate which should >>>>>> not require so much memory to run. >>>>>> >>>>>> Any ideas? >>>>>> >>>>>> Plus I don't see any GC pause, looks like memory is still used by >>>>>> someone inside driver. >>>>>> >>>>>> [image: Inline image 2] >>>>>> [image: Inline image 1] >>>>>> >>>>> >>>>> >>>> >>> >> >