Re: which mllib algorithm for large multi-class classification?

Danny Linden Wed, 24 Jun 2015 16:28:00 -0700

Hi,

here the Stack trace, thx for every help:


15/06/24 23:15:26 INFO DAGScheduler: Submitting ShuffleMapStage 19 
(MapPartitionsRDD[49] at treeAggregate at LBFGS.scala:218), which has no 
missing parents
[error] (dag-scheduler-event-loop) java.lang.OutOfMemoryError: Requested array 
size exceeds VM limit
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:2271)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
        at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
        at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
        at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
        at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
        at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:869)
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:818)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:817)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:817)
        at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

> Am 24.06.2015 um 03:27 schrieb Xiangrui Meng <[email protected]>:
> 
> We have multinomial logistic regression implemented. For your case,
> the model size is 500 * 300,000 = 150,000,000. MLlib's implementation
> might not be able to handle it efficiently, we plan to have a more
> scalable implementation in 1.5. However, it shouldn't give you an
> "array larger than MaxInt" exception. Could you paste the stack trace?
> -Xiangrui
> 
> On Mon, Jun 22, 2015 at 4:21 PM, Danny <[email protected]> wrote:
>> hi,
>> 
>> I am unfortunately not very fit in the whole MLlib stuff, so I would
>> appreciate a little help:
>> 
>> Which multi-class classification algorithm i should use if i want to train
>> texts (100-1000 words each) into categories. The number of categories is
>> between 100-500 and the number of training documents which i have transform
>> to tf-idf vectors is max ~ 300.000
>> 
>> it looks like the most algorithms are running into OOM exception or "array
>> larger than MaxInt" exceptions with a large number of classes/categories
>> cause there are "collect" steps in it?
>> 
>> thanks a lot
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/which-mllib-algorithm-for-large-multi-class-classification-tp23439.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: which mllib algorithm for large multi-class classification?

Reply via email to