Thanks Zhan, I'm also confused about the jstack output, why the driver gets stuck at "org.apache.spark.SparkContext.clean" ?
On Tue, Jan 6, 2015 at 2:10 PM, Zhan Zhang <zzh...@hortonworks.com> wrote: > I think it is overflow. The training data is quite big. The algorithms > scalability highly depends on the vocabSize. Even without overflow, there > are still other bottlenecks, for example, syn0Global and syn1Global, each > of them has vocabSize * vectorSize elements. > > Thanks. > > Zhan Zhang > > > > On Jan 5, 2015, at 7:47 PM, Eric Zhen <zhpeng...@gmail.com> wrote: > > Hi Xiangrui, > > Our dataset is about 80GB(10B lines). > > In the driver's log, we foud this: > > *INFO Word2Vec: trainWordsCount = -1610413239* > > it seems that there is a integer overflow? > > > On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng <men...@gmail.com> wrote: > >> How big is your dataset, and what is the vocabulary size? -Xiangrui >> >> On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen <zhpeng...@gmail.com> wrote: >> > Hi, >> > >> > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup >> > usage. Here is the jstack output: >> > >> > "main" prio=10 tid=0x0000000040112800 nid=0x46f2 runnable >> > [0x000000004162e000] >> > java.lang.Thread.State: RUNNABLE >> > at >> > >> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847) >> > at >> > >> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778) >> > at java.io.DataOutputStream.writeInt(DataOutputStream.java:182) >> > at >> java.io.DataOutputStream.writeFloat(DataOutputStream.java:225) >> > at >> > >> java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064) >> > at >> > java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310) >> > at >> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154) >> > at >> > >> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) >> > at >> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) >> > at >> > >> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) >> > at >> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) >> > at >> > >> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) >> > at >> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) >> > at >> > >> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) >> > at >> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) >> > at >> > >> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) >> > at >> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) >> > at >> > >> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) >> > at >> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) >> > at >> > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330) >> > at >> > >> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) >> > at >> > >> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) >> > at >> > >> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) >> > at >> > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) >> > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) >> > at >> org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) >> > at >> > >> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) >> > at >> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) >> > at >> org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) >> > at com.baidu.inf.WordCount$.main(WordCount.scala:31) >> > at com.baidu.inf.WordCount.main(WordCount.scala) >> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> > at >> > >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> > at >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> > at java.lang.reflect.Method.invoke(Method.java:597) >> > at >> > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) >> > at >> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) >> > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >> > >> > -- >> > Best Regards >> > > > > -- > Best Regards > > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. -- Best Regards