Hi, here the Stack trace, thx for every help:
15/06/24 23:15:26 INFO DAGScheduler: Submitting ShuffleMapStage 19
(MapPartitionsRDD[49] at treeAggregate at LBFGS.scala:218), which has no
missing parents
[error] (dag-scheduler-event-loop) java.lang.OutOfMemoryError: Requested array
size exceeds VM limit
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:869)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:818)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:817)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:817)
at
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Am 24.06.2015 um 03:27 schrieb Xiangrui Meng <[email protected]>:
>
> We have multinomial logistic regression implemented. For your case,
> the model size is 500 * 300,000 = 150,000,000. MLlib's implementation
> might not be able to handle it efficiently, we plan to have a more
> scalable implementation in 1.5. However, it shouldn't give you an
> "array larger than MaxInt" exception. Could you paste the stack trace?
> -Xiangrui
>
> On Mon, Jun 22, 2015 at 4:21 PM, Danny <[email protected]> wrote:
>> hi,
>>
>> I am unfortunately not very fit in the whole MLlib stuff, so I would
>> appreciate a little help:
>>
>> Which multi-class classification algorithm i should use if i want to train
>> texts (100-1000 words each) into categories. The number of categories is
>> between 100-500 and the number of training documents which i have transform
>> to tf-idf vectors is max ~ 300.000
>>
>> it looks like the most algorithms are running into OOM exception or "array
>> larger than MaxInt" exceptions with a large number of classes/categories
>> cause there are "collect" steps in it?
>>
>> thanks a lot
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/which-mllib-algorithm-for-large-multi-class-classification-tp23439.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
signature.asc
Description: Message signed with OpenPGP using GPGMail
