A single vector of size 10^7 won't hit that bound. How many clusters did you set? The broadcast variable size is 10^7 * k and you can calculate the amount of memory it needs. Try to reduce the number of tasks and see whether it helps. -Xiangrui
On Tue, Feb 17, 2015 at 7:20 PM, lihu <lihu...@gmail.com> wrote: > Thanks for your answer. Yes, I cached the data, I can observed from the > WebUI that all the data is cached in the memory. > > What I worry is that the dimension, not the total size. > > Sean Owen ever answered me that the Broadcast support the maximum array size > is 2GB, so 10^7 is a little huge? > > On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng <men...@gmail.com> wrote: >> >> Did you cache the data? Was it fully cached? The k-means >> implementation doesn't create many temporary objects. I guess you need >> more RAM to avoid GC triggered frequently. Please monitor the memory >> usage using YourKit or VisualVM. -Xiangrui >> >> On Wed, Feb 11, 2015 at 1:35 AM, lihu <lihu...@gmail.com> wrote: >> > I just want to make the best use of CPU, and test the performance of >> > spark >> > if there is a lot of task in a single node. >> > >> > On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen <so...@cloudera.com> wrote: >> >> >> >> Good, worth double-checking that's what you got. That's barely 1GB per >> >> task though. Why run 48 if you have 24 cores? >> >> >> >> On Wed, Feb 11, 2015 at 9:03 AM, lihu <lihu...@gmail.com> wrote: >> >> > I give 50GB to the executor, so it seem that there is no reason the >> >> > memory >> >> > is not enough. >> >> > >> >> > On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen <so...@cloudera.com> >> >> > wrote: >> >> >> >> >> >> Meaning, you have 128GB per machine but how much memory are you >> >> >> giving >> >> >> the executors? >> >> >> >> >> >> On Wed, Feb 11, 2015 at 8:49 AM, lihu <lihu...@gmail.com> wrote: >> >> >> > What do you mean? Yes,I an see there is some data put in the >> >> >> > memory >> >> >> > from >> >> >> > the web ui. >> >> >> > >> >> >> > On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen <so...@cloudera.com> >> >> >> > wrote: >> >> >> >> >> >> >> >> Are you actually using that memory for executors? >> >> >> >> >> >> >> >> On Wed, Feb 11, 2015 at 8:17 AM, lihu <lihu...@gmail.com> wrote: >> >> >> >> > Hi, >> >> >> >> > I run the kmeans(MLlib) in a cluster with 12 workers. >> >> >> >> > Every >> >> >> >> > work >> >> >> >> > own a >> >> >> >> > 128G RAM, 24Core. I run 48 task in one machine. the total data >> >> >> >> > is >> >> >> >> > just >> >> >> >> > 40GB. >> >> >> >> > >> >> >> >> > When the dimension of the data set is about 10^7, for every >> >> >> >> > task >> >> >> >> > the >> >> >> >> > duration is about 30s, but the cost for GC is about 20s. >> >> >> >> > >> >> >> >> > When I reduce the dimension to 10^4, then the gc is small. >> >> >> >> > >> >> >> >> > So why gc is so high when the dimension is larger? or this >> >> >> >> > is >> >> >> >> > the >> >> >> >> > reason >> >> >> >> > caused by MLlib? >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > Best Wishes! >> >> >> > >> >> >> > Li Hu(李浒) | Graduate Student >> >> >> > Institute for Interdisciplinary Information Sciences(IIIS) >> >> >> > Tsinghua University, China >> >> >> > >> >> >> > Email: lihu...@gmail.com >> >> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ >> >> >> > >> >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > Best Wishes! >> >> > >> >> > Li Hu(李浒) | Graduate Student >> >> > Institute for Interdisciplinary Information Sciences(IIIS) >> >> > Tsinghua University, China >> >> > >> >> > Email: lihu...@gmail.com >> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ >> >> > >> >> > >> > >> > >> > >> > >> > -- >> > Best Wishes! >> > >> > Li Hu(李浒) | Graduate Student >> > Institute for Interdisciplinary Information Sciences(IIIS) >> > Tsinghua University, China >> > >> > Email: lihu...@gmail.com >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ >> > >> > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org