Oops, just kidding, this method is not in the current release. However, it is 
included in the latest commit on git if you want to do a build.


> On Jan 6, 2015, at 2:56 PM, Ganon Pierce <ganon.pie...@me.com> wrote:
> 
> Two billion words is a very large vocabulary… You can try solving this issue 
> by by setting the number of times words must occur in order to be included in 
> the vocabulary using setMinCount, this will be prevent common misspellings, 
> websites, and other things from being included and may improve the quality of 
> your model overall.
> 
>  
>> On Jan 6, 2015, at 12:59 AM, Eric Zhen <zhpeng...@gmail.com 
>> <mailto:zhpeng...@gmail.com>> wrote:
>> 
>> Thanks Zhan, I'm also confused about the jstack output, why the driver gets 
>> stuck at  "org.apache.spark.SparkContext.clean" ?
>> 
>> On Tue, Jan 6, 2015 at 2:10 PM, Zhan Zhang <zzh...@hortonworks.com 
>> <mailto:zzh...@hortonworks.com>> wrote:
>> I think it is overflow. The training data is quite big. The algorithms  
>> scalability highly depends on the vocabSize. Even without overflow, there 
>> are still other bottlenecks, for example, syn0Global and syn1Global, each of 
>> them has vocabSize * vectorSize elements.
>> 
>> Thanks.
>> 
>> Zhan Zhang
>> 
>> 
>> 
>> On Jan 5, 2015, at 7:47 PM, Eric Zhen <zhpeng...@gmail.com 
>> <mailto:zhpeng...@gmail.com>> wrote:
>> 
>>> Hi Xiangrui,
>>> 
>>> Our dataset is about 80GB(10B lines). 
>>> 
>>> In the driver's log, we foud this:
>>> 
>>> INFO Word2Vec: trainWordsCount = -1610413239
>>> 
>>> it seems that there is a integer overflow?
>>> 
>>> 
>>> On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng <men...@gmail.com 
>>> <mailto:men...@gmail.com>> wrote:
>>> How big is your dataset, and what is the vocabulary size? -Xiangrui
>>> 
>>> On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen <zhpeng...@gmail.com 
>>> <mailto:zhpeng...@gmail.com>> wrote:
>>> > Hi,
>>> >
>>> > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
>>> > usage. Here is the jstack output:
>>> >
>>> > "main" prio=10 tid=0x0000000040112800 nid=0x46f2 runnable
>>> > [0x000000004162e000]
>>> >    java.lang.Thread.State: RUNNABLE
>>> >         at
>>> > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847)
>>> >         at
>>> > java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778)
>>> >         at java.io.DataOutputStream.writeInt(DataOutputStream.java:182)
>>> >         at java.io.DataOutputStream.writeFloat(DataOutputStream.java:225)
>>> >         at
>>> > java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064)
>>> >         at
>>> > java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310)
>>> >         at
>>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154)
>>> >         at
>>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>>> >         at
>>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>>> >         at
>>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>>> >         at
>>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>>> >         at
>>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>>> >         at
>>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>>> >         at
>>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>>> >         at
>>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>>> >         at
>>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>>> >         at
>>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>>> >         at
>>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>>> >         at
>>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>>> >         at
>>> > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
>>> >         at
>>> > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
>>> >         at
>>> > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
>>> >         at
>>> > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
>>> >         at
>>> > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>>> >         at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
>>> >         at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
>>> >         at
>>> > org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
>>> >         at 
>>> > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>>> >         at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)
>>> >         at com.baidu.inf.WordCount$.main(WordCount.scala:31)
>>> >         at com.baidu.inf.WordCount.main(WordCount.scala)
>>> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> >         at
>>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> >         at
>>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>> >         at java.lang.reflect.Method.invoke(Method.java:597)
>>> >         at
>>> > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
>>> >         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>> >         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>> >
>>> > --
>>> > Best Regards
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards
>> 
>> 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to 
>> which it is addressed and may contain information that is confidential, 
>> privileged and exempt from disclosure under applicable law. If the reader of 
>> this message is not the intended recipient, you are hereby notified that any 
>> printing, copying, dissemination, distribution, disclosure or forwarding of 
>> this communication is strictly prohibited. If you have received this 
>> communication in error, please contact the sender immediately and delete it 
>> from your system. Thank You.
>> 
>> 
>> 
>> -- 
>> Best Regards
> 

Reply via email to