Hi, I'm using a cluster with 5 nodes that each use 8 cores and 10GB of RAM Basically I'm creating a dictionary from text, i.e. giving each words that occurs more than n times in all texts a unique identifier.
The essential port of the code looks like that: var texts = ctx.sql("SELECT text FROM table LIMIT 15000000").map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) var dict2 = texts.flatMap(_.split(" ").map(_.toLowerCase())).repartition(80) dict2 = dict2.filter(s => s.startsWith("http") == false) dict2 = dict2.filter(s => s.startsWith("@") == false) dict2 = dict2.map(removePunctuation(_)) //removes .,?!:; in strings (single words) dict2 = dict2.groupBy(identity).filter(_._2.size > 10).keys //only keep entries that occur more than n times. var dict3 = dict2.zipWithIndex var dictM = dict3.collect.toMap var count = dictM.size If I use only 10M texts, it works. With 15M texts as above I get the following error. It occurs after the dictM.size operation, but due to laziness there isn't any computing happening before that. 14/08/27 22:36:29 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks 14/08/27 22:36:29 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 2028, idp11.foo.bar, PROCESS_LOCAL, 921 bytes) 14/08/27 22:36:29 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on idp11.foo.bar:36295 (size: 9.4 KB, free: 10.4 GB) 14/08/27 22:36:30 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 2 to sp...@idp11.foo.bar:33925 14/08/27 22:36:30 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 1263 bytes 14/08/27 22:37:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 2028, idp11.foo.bar): java.lang.OutOfMemoryError: Requested array size exceeds VM limit java.util.Arrays.copyOf(Arrays.java:3230) java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) ... I'm fine with spilling to disk if my program runs out of memory, but is there anything to prevent this error without changing Java Memory settings? (assume those are at the physical maximum) Kind regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-Requested-array-size-exceeds-VM-limit-tp12993.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org