We store the vectors on the driver node. So it is hard to handle a
really large vocabulary. You can use setMinCount to filter out
infrequent word to reduce the model size. -Xiangrui

On Wed, Apr 22, 2015 at 12:32 AM, gm yu <husty...@gmail.com> wrote:
> When use Mllib.Word2Vec, I meet the following error:
>
>  allocating large
> array--thread_id[0x00007ff2741ca000]--thread_name[Driver]--array_size[1146093680
> bytes]--array_length[1146093656 elememts]
> prio=10 tid=0x00007ff2741ca000 nid=0x1405f runnable
>       at java.util.Arrays.copyOf(Arrays.java:2786)
>       at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>       - locked <0x00007ff33f7fafd0> (a java.io.ByteArrayOutputStream)
>       at
> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1812)
>       at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1504)
>       at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>       at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>       at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1346)
>       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154)
>       at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>       at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>       at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>       at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>       at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>       at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>       at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>       at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>       at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>       at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
>       at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>       at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
>       at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
>       at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>       at org.apache.spark.SparkContext.clean(SparkContext.scala:1627)
>       at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:635)
>       at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:270)
>       at com.taobao.changrui.SynonymFind$.main(SynonymFind.scala:79)
>       at com.taobao.changrui.SynonymFind.main(SynonymFind.scala)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:516)
>
>
> The data size is: 100M+ sentences, 100M+ words
>
> Jos Setting is: 50 executors with 20GB and 4cores, the driver memory is 30GB
>
>
> Any ideas? Thank you.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to