Hi,

I'm using a cluster with 5 nodes that each use 8 cores and 10GB of RAM
Basically I'm creating a dictionary from text, i.e. giving each words that
occurs more than n times in all texts a unique identifier.


The essential port of the code looks like that:

var texts = ctx.sql("SELECT text FROM table LIMIT
15000000").map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
 

var dict2 = texts.flatMap(_.split(" ").map(_.toLowerCase())).repartition(80)
dict2 = dict2.filter(s => s.startsWith("http") == false)
dict2 = dict2.filter(s => s.startsWith("@") == false)
dict2 = dict2.map(removePunctuation(_)) //removes .,?!:; in strings (single
words)
dict2 = dict2.groupBy(identity).filter(_._2.size > 10).keys //only keep
entries that occur more than n times.
var dict3 = dict2.zipWithIndex
var dictM = dict3.collect.toMap

var count = dictM.size


If I use only 10M texts, it works. With 15M texts as above I get the
following error.
It occurs after the dictM.size operation, but due to laziness there isn't
any computing happening before that.

14/08/27 22:36:29 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with
1 tasks
14/08/27 22:36:29 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
3.0 (TID 2028, idp11.foo.bar, PROCESS_LOCAL, 921 bytes)
14/08/27 22:36:29 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in
memory on idp11.foo.bar:36295 (size: 9.4 KB, free: 10.4 GB)
14/08/27 22:36:30 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 2 to sp...@idp11.foo.bar:33925
14/08/27 22:36:30 INFO spark.MapOutputTrackerMaster: Size of output statuses
for shuffle 2 is 1263 bytes
14/08/27 22:37:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0
(TID 2028, idp11.foo.bar): java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
        java.util.Arrays.copyOf(Arrays.java:3230)
        java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
       
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
        java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
        ...



I'm fine with spilling to disk if my program runs out of memory, but is
there anything to prevent this error without changing Java Memory settings?
(assume those are at the physical maximum)


Kind regards,
Simon



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-Requested-array-size-exceeds-VM-limit-tp12993.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to