We have multinomial logistic regression implemented. For your case, the model size is 500 * 300,000 = 150,000,000. MLlib's implementation might not be able to handle it efficiently, we plan to have a more scalable implementation in 1.5. However, it shouldn't give you an "array larger than MaxInt" exception. Could you paste the stack trace? -Xiangrui
On Mon, Jun 22, 2015 at 4:21 PM, Danny <kont...@dannylinden.de> wrote: > hi, > > I am unfortunately not very fit in the whole MLlib stuff, so I would > appreciate a little help: > > Which multi-class classification algorithm i should use if i want to train > texts (100-1000 words each) into categories. The number of categories is > between 100-500 and the number of training documents which i have transform > to tf-idf vectors is max ~ 300.000 > > it looks like the most algorithms are running into OOM exception or "array > larger than MaxInt" exceptions with a large number of classes/categories > cause there are "collect" steps in it? > > thanks a lot > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/which-mllib-algorithm-for-large-multi-class-classification-tp23439.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org