Hallo,
I used Mahout for Text Classification and now I'm trying with Spark.

I had the same Problem training Bayes with (only) 569 Documents.

I solved doing htf = HashingTF(5000) instead of htf = HashingTF() [default
Features Space 2^20). I don't know if it can be considered a longterm
Solution (what will it happen trying to train with much much more
Documents?) but I have two bigger issues at the moment.

My first issue at the moment is the creation of the LabeledPoint for the
Bayes Model.
The TFIDF Transformation gives back a RDD with Sparse Vector and I saved my
Labels (Categories) in another RDD.

I still didn't find an good solution to combine both Information while
creating the LabeledPoint. 
My Solution costs a lot of collects (one pro Document). Each collect takes 4
Sec (Running on a VM with 16GB RAM, 8 Core) and it results in circa 40
Minutes only to create the LabeledPoint after the TFIDF Calculation.....

My second Issue is that maybe saving Labels and Features separated and
combine them later could cause problems while running on more Nodes (now
running on a single Node) because I can not be sure that the Order of the
Labels I saved will match to the Order of Features in the Sparse
Vector...or?

Is there a Post or "BestPractice" I can read to solve the two Issues?
Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-exception-in-MLlib-s-naive-baye-s-classification-training-tp14809p24357.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to