Re: high GC in the Kmeans algorithm

Xiangrui Meng Fri, 20 Feb 2015 13:46:32 -0800

A single vector of size 10^7 won't hit that bound. How many clusters
did you set? The broadcast variable size is 10^7 * k and you can
calculate the amount of memory it needs. Try to reduce the number of
tasks and see whether it helps. -Xiangrui


On Tue, Feb 17, 2015 at 7:20 PM, lihu <lihu...@gmail.com> wrote:
> Thanks for your answer. Yes, I cached the data, I can observed from the
> WebUI that all the data is cached in the memory.
>
> What I worry is that the dimension,  not the total size.
>
> Sean Owen ever answered me that the Broadcast support the maximum array size
> is 2GB, so 10^7 is a little huge?
>
> On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> Did you cache the data? Was it fully cached? The k-means
>> implementation doesn't create many temporary objects. I guess you need
>> more RAM to avoid GC triggered frequently. Please monitor the memory
>> usage using YourKit or VisualVM. -Xiangrui
>>
>> On Wed, Feb 11, 2015 at 1:35 AM, lihu <lihu...@gmail.com> wrote:
>> > I just want to make the best use of CPU,  and test the performance of
>> > spark
>> > if there is a lot of task in a single node.
>> >
>> > On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> Good, worth double-checking that's what you got. That's barely 1GB per
>> >> task though. Why run 48 if you have 24 cores?
>> >>
>> >> On Wed, Feb 11, 2015 at 9:03 AM, lihu <lihu...@gmail.com> wrote:
>> >> > I give 50GB to the executor,  so it seem that  there is no reason the
>> >> > memory
>> >> > is not enough.
>> >> >
>> >> > On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen <so...@cloudera.com>
>> >> > wrote:
>> >> >>
>> >> >> Meaning, you have 128GB per machine but how much memory are you
>> >> >> giving
>> >> >> the executors?
>> >> >>
>> >> >> On Wed, Feb 11, 2015 at 8:49 AM, lihu <lihu...@gmail.com> wrote:
>> >> >> > What do you mean?  Yes,I an see there  is some data put in the
>> >> >> > memory
>> >> >> > from
>> >> >> > the web ui.
>> >> >> >
>> >> >> > On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen <so...@cloudera.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Are you actually using that memory for executors?
>> >> >> >>
>> >> >> >> On Wed, Feb 11, 2015 at 8:17 AM, lihu <lihu...@gmail.com> wrote:
>> >> >> >> > Hi,
>> >> >> >> >     I  run the kmeans(MLlib) in a cluster with 12 workers.
>> >> >> >> > Every
>> >> >> >> > work
>> >> >> >> > own a
>> >> >> >> > 128G RAM, 24Core. I run 48 task in one machine. the total data
>> >> >> >> > is
>> >> >> >> > just
>> >> >> >> > 40GB.
>> >> >> >> >
>> >> >> >> >    When the dimension of the data set is about 10^7, for every
>> >> >> >> > task
>> >> >> >> > the
>> >> >> >> > duration is about 30s, but the cost for GC is about 20s.
>> >> >> >> >
>> >> >> >> >    When I reduce the dimension to 10^4, then the gc is small.
>> >> >> >> >
>> >> >> >> >     So why gc is so high when the dimension is larger? or this
>> >> >> >> > is
>> >> >> >> > the
>> >> >> >> > reason
>> >> >> >> > caused by MLlib?
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Best Wishes!
>> >> >> >
>> >> >> > Li Hu(李浒) | Graduate Student
>> >> >> > Institute for Interdisciplinary Information Sciences(IIIS)
>> >> >> > Tsinghua University, China
>> >> >> >
>> >> >> > Email: lihu...@gmail.com
>> >> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best Wishes!
>> >> >
>> >> > Li Hu(李浒) | Graduate Student
>> >> > Institute for Interdisciplinary Information Sciences(IIIS)
>> >> > Tsinghua University, China
>> >> >
>> >> > Email: lihu...@gmail.com
>> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Best Wishes!
>> >
>> > Li Hu(李浒) | Graduate Student
>> > Institute for Interdisciplinary Information Sciences(IIIS)
>> > Tsinghua University, China
>> >
>> > Email: lihu...@gmail.com
>> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>> >
>> >
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: high GC in the Kmeans algorithm

Reply via email to