Re: Out of memory exception in MLlib's naive baye's classification training

2015-08-20 Thread minerva
Hallo, I used Mahout for Text Classification and now I'm trying with Spark. I had the same Problem training Bayes with (only) 569 Documents. I solved doing htf = HashingTF(5000) instead of htf = HashingTF() [default Features Space 2^20). I don't know if it can be considered a longterm Solution (w

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-25 Thread Xiangrui Meng
For the vectorizer, what's the output feature dimension and are you creating sparse vectors or dense vectors? The model on the driver consists of numClasses * numFeatures doubles. However, the driver needs more memory in order to receive the task result (of the same size) from executors. So you nee

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-24 Thread jatinpreet
Hi, I was able to get the training running in local mode with default settings, there was a problem with document labels which were quite large(not 20 as suggested earlier). I am currently training 175000 documents on a single node with 2GB of executor memory and 5GB of driver memory successfull

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
Xiangrui, Thanks for replying. I am using the subset of newsgroup20 data. I will send you the vectorized data for analysis shortly. I have tried running in local mode as well but I get the same OOM exception. I started with 4GB of data but then moved to smaller set to verify that everything was

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread Xiangrui Meng
You dataset is small. NaiveBayes should work under the default settings, even in local mode. Could you try local mode first without changing any Spark settings? Since your dataset is small, could you save the vectorized data (RDD[LabeledPoint]) and send me a sample? I want to take a look at the fea

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
I get the following stacktrace if it is of any help. 14/09/23 15:46:02 INFO scheduler.DAGScheduler: failed: Set() 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Missing parents for Stage 7: List() 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting Stage 7 (MapPartitionsRDD[24] at combineByK

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
Xiangrui, Yes, the total number of terms is 43839. I have also tried running it using different values of parallelism ranging from 1/core to 10/core. I also used multiple configurations like setting spark.storage.memoryFaction and spark.shuffle.memoryFraction to default values. The point to note

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-22 Thread Xiangrui Meng
d that is unnerving for > me. I can't use Hashing TF available with Spark due to the resultant > decrease in accuracy, but with this feature size, I expect Spark to run > easily. Thanks, Jatin > Novice Big Data Programmer > > ________________ > View this messag

Out of memory exception in MLlib's naive baye's classification training

2014-09-22 Thread jatinpreet
Hi,I have been facing an unusual issue with Naive Baye's training. I run out of heap space with even with limited data during training phase. I am trying to run the same on a rudimentary cluster of two development machines in standalone mode.I am reading data from an HBase table, converting them in