Oops, just kidding, this method is not in the current release. However, it is included in the latest commit on git if you want to do a build.
> On Jan 6, 2015, at 2:56 PM, Ganon Pierce <ganon.pie...@me.com> wrote: > > Two billion words is a very large vocabulary… You can try solving this issue > by by setting the number of times words must occur in order to be included in > the vocabulary using setMinCount, this will be prevent common misspellings, > websites, and other things from being included and may improve the quality of > your model overall. > > >> On Jan 6, 2015, at 12:59 AM, Eric Zhen <zhpeng...@gmail.com >> <mailto:zhpeng...@gmail.com>> wrote: >> >> Thanks Zhan, I'm also confused about the jstack output, why the driver gets >> stuck at "org.apache.spark.SparkContext.clean" ? >> >> On Tue, Jan 6, 2015 at 2:10 PM, Zhan Zhang <zzh...@hortonworks.com >> <mailto:zzh...@hortonworks.com>> wrote: >> I think it is overflow. The training data is quite big. The algorithms >> scalability highly depends on the vocabSize. Even without overflow, there >> are still other bottlenecks, for example, syn0Global and syn1Global, each of >> them has vocabSize * vectorSize elements. >> >> Thanks. >> >> Zhan Zhang >> >> >> >> On Jan 5, 2015, at 7:47 PM, Eric Zhen <zhpeng...@gmail.com >> <mailto:zhpeng...@gmail.com>> wrote: >> >>> Hi Xiangrui, >>> >>> Our dataset is about 80GB(10B lines). >>> >>> In the driver's log, we foud this: >>> >>> INFO Word2Vec: trainWordsCount = -1610413239 >>> >>> it seems that there is a integer overflow? >>> >>> >>> On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng <men...@gmail.com >>> <mailto:men...@gmail.com>> wrote: >>> How big is your dataset, and what is the vocabulary size? -Xiangrui >>> >>> On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen <zhpeng...@gmail.com >>> <mailto:zhpeng...@gmail.com>> wrote: >>> > Hi, >>> > >>> > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup >>> > usage. Here is the jstack output: >>> > >>> > "main" prio=10 tid=0x0000000040112800 nid=0x46f2 runnable >>> > [0x000000004162e000] >>> > java.lang.Thread.State: RUNNABLE >>> > at >>> > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847) >>> > at >>> > java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778) >>> > at java.io.DataOutputStream.writeInt(DataOutputStream.java:182) >>> > at java.io.DataOutputStream.writeFloat(DataOutputStream.java:225) >>> > at >>> > java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064) >>> > at >>> > java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310) >>> > at >>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154) >>> > at >>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) >>> > at >>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) >>> > at >>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) >>> > at >>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) >>> > at >>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) >>> > at >>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) >>> > at >>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) >>> > at >>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) >>> > at >>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) >>> > at >>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) >>> > at >>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) >>> > at >>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) >>> > at >>> > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330) >>> > at >>> > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) >>> > at >>> > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) >>> > at >>> > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) >>> > at >>> > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) >>> > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) >>> > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) >>> > at >>> > org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) >>> > at >>> > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) >>> > at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) >>> > at com.baidu.inf.WordCount$.main(WordCount.scala:31) >>> > at com.baidu.inf.WordCount.main(WordCount.scala) >>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> > at >>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> > at >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> > at java.lang.reflect.Method.invoke(Method.java:597) >>> > at >>> > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) >>> > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) >>> > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >>> > >>> > -- >>> > Best Regards >>> >>> >>> >>> -- >>> Best Regards >> >> >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or entity to >> which it is addressed and may contain information that is confidential, >> privileged and exempt from disclosure under applicable law. If the reader of >> this message is not the intended recipient, you are hereby notified that any >> printing, copying, dissemination, distribution, disclosure or forwarding of >> this communication is strictly prohibited. If you have received this >> communication in error, please contact the sender immediately and delete it >> from your system. Thank You. >> >> >> >> -- >> Best Regards >