Hallo, I used Mahout for Text Classification and now I'm trying with Spark.
I had the same Problem training Bayes with (only) 569 Documents. I solved doing htf = HashingTF(5000) instead of htf = HashingTF() [default Features Space 2^20). I don't know if it can be considered a longterm Solution (what will it happen trying to train with much much more Documents?) but I have two bigger issues at the moment. My first issue at the moment is the creation of the LabeledPoint for the Bayes Model. The TFIDF Transformation gives back a RDD with Sparse Vector and I saved my Labels (Categories) in another RDD. I still didn't find an good solution to combine both Information while creating the LabeledPoint. My Solution costs a lot of collects (one pro Document). Each collect takes 4 Sec (Running on a VM with 16GB RAM, 8 Core) and it results in circa 40 Minutes only to create the LabeledPoint after the TFIDF Calculation..... My second Issue is that maybe saving Labels and Features separated and combine them later could cause problems while running on more Nodes (now running on a single Node) because I can not be sure that the Order of the Labels I saved will match to the Order of Features in the Sparse Vector...or? Is there a Post or "BestPractice" I can read to solve the two Issues? Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-exception-in-MLlib-s-naive-baye-s-classification-training-tp14809p24357.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org