You dataset is small. NaiveBayes should work under the default settings, even in local mode. Could you try local mode first without changing any Spark settings? Since your dataset is small, could you save the vectorized data (RDD[LabeledPoint]) and send me a sample? I want to take a look at the feature dimension. -Xiangrui
On Tue, Sep 23, 2014 at 3:22 AM, jatinpreet <jatinpr...@gmail.com> wrote: > I get the following stacktrace if it is of any help. > > 14/09/23 15:46:02 INFO scheduler.DAGScheduler: failed: Set() > 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Missing parents for Stage 7: > List() > 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting Stage 7 > (MapPartitionsRDD[24] at combineByKey at NaiveBayes.scala:91), which is now > runnable > 14/09/23 15:46:02 INFO executor.Executor: Finished task ID 7 > 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from Stage 7 (MapPartitionsRDD[24] at combineByKey at NaiveBayes.scala:91) > 14/09/23 15:46:02 INFO scheduler.TaskSchedulerImpl: Adding task set 7.0 with > 1 tasks > 14/09/23 15:46:02 INFO scheduler.TaskSetManager: Starting task 7.0:0 as TID > 8 on executor localhost: localhost (PROCESS_LOCAL) > 14/09/23 15:46:02 INFO scheduler.TaskSetManager: Serialized task 7.0:0 as > 535061 bytes in 1 ms > 14/09/23 15:46:02 INFO executor.Executor: Running task ID 8 > 14/09/23 15:46:02 INFO storage.BlockManager: Found block broadcast_0 locally > 14/09/23 15:46:03 INFO > storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: > 50331648, targetRequestSize: 10066329 > 14/09/23 15:46:03 INFO > storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty > blocks out of 1 blocks > 14/09/23 15:46:03 INFO > storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote > fetches in 1 ms > 14/09/23 15:46:04 WARN collection.ExternalAppendOnlyMap: Spilling in-memory > map of 452 MB to disk (1 time so far) > 14/09/23 15:46:07 WARN collection.ExternalAppendOnlyMap: Spilling in-memory > map of 452 MB to disk (2 times so far) > 14/09/23 15:46:09 WARN collection.ExternalAppendOnlyMap: Spilling in-memory > map of 438 MB to disk (3 times so far) > 14/09/23 15:46:12 WARN collection.ExternalAppendOnlyMap: Spilling in-memory > map of 479 MB to disk (4 times so far) > 14/09/23 15:46:22 ERROR executor.Executor: Exception in task ID 8 > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:3236) > at > java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at > java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) > at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 14/09/23 15:46:22 WARN scheduler.TaskSetManager: Lost TID 8 (task 7.0:0) > 14/09/23 15:46:22 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught > exception in thread Thread[Executor task launch worker-1,5,main] > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:3236) > at > java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at > java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) > at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 14/09/23 15:46:22 WARN scheduler.TaskSetManager: Loss was due to > java.lang.OutOfMemoryError > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:3236) > at > java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at > java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) > at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 14/09/23 15:46:22 ERROR scheduler.TaskSetManager: Task 7.0:0 failed 1 times; > aborting job > 14/09/23 15:46:22 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 7.0, > whose tasks have all completed, from pool > 14/09/23 15:46:22 INFO scheduler.TaskSchedulerImpl: Cancelling stage 7 > 14/09/23 15:46:22 INFO scheduler.DAGScheduler: Failed to run collect at > NaiveBayes.scala:96 > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 7.0:0 failed 1 times, most recent failure: Exception > failure in TID 8 on host localhost: java.lang.OutOfMemoryError: Java heap > space > java.util.Arrays.copyOf(Arrays.java:3236) > java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) > > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71) > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > > > > ----- > Novice Big Data Programmer > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-exception-in-MLlib-s-naive-baye-s-classification-training-tp14809p14880.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org