First thought: 70K features is *a lot* for the MLlib implementation (and
any PLANET-like implementation)

Using fewer partitions is a good idea.

Which Spark version was this on?

On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov <evgeny.a.moro...@gmail.com>
wrote:

> The questions I have in mind:
>
> Is it smth that the one might expect? From the stack trace itself it's not
> clear where does it come from.
> Is it an already known bug? Although I haven't found anything like that.
> Is it possible to configure something to workaround / avoid this?
>
> I'm not sure it's the right thing to do, but I've
>     increased thread stack size 10 times (to 80MB)
>     reduced default parallelism 10 times (only 20 cores are available)
>
> Thank you in advance.
>
> --
> Be well!
> Jean Morozov
>
> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a web service that provides rest api to train random forest algo.
>> I train random forest on a 5 nodes spark cluster with enough memory -
>> everything is cached (~22 GB).
>> On a small datasets up to 100k samples everything is fine, but with the
>> biggest one (400k samples and ~70k features) I'm stuck with
>> StackOverflowError.
>>
>> Additional options for my web service
>>     spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
>>     spark.default.parallelism = 200.
>>
>> On a 400k samples dataset
>> - (with default thread stack size) it took 4 hours of training to get the
>> error.
>> - with increased stack size it took 60 hours to hit it.
>> I can increase it, but it's hard to say what amount of memory it needs
>> and it's applied to all of the treads and might waste a lot of memory.
>>
>> I'm looking at different stages at event timeline now and see that task
>> deserialization time gradually increases. And at the end task
>> deserialization time is roughly same as executor computing time.
>>
>> Code I use to train model:
>>
>> int MAX_BINS = 16;
>> int NUM_CLASSES = 0;
>> double MIN_INFO_GAIN = 0.0;
>> int MAX_MEMORY_IN_MB = 256;
>> double SUBSAMPLING_RATE = 1.0;
>> boolean USE_NODEID_CACHE = true;
>> int CHECKPOINT_INTERVAL = 10;
>> int RANDOM_SEED = 12345;
>>
>> int NODE_SIZE = 5;
>> int maxDepth = 30;
>> int numTrees = 50;
>> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
>> maxDepth, NUM_CLASSES, MAX_BINS,
>>         QuantileStrategy.Sort(), new scala.collection.immutable.HashMap<>(), 
>> nodeSize, MIN_INFO_GAIN,
>>         MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
>> CHECKPOINT_INTERVAL);
>> RandomForestModel model = RandomForest.trainRegressor(labeledPoints.rdd(), 
>> strategy, numTrees, "auto", RANDOM_SEED);
>>
>>
>> Any advice would be highly appreciated.
>>
>> The exception (~3000 lines long):
>>  java.lang.StackOverflowError
>>         at
>> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
>>         at
>> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
>>         at
>> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
>>         at
>> java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
>>         at
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
>>         at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>>         at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>         at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>>         at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>>         at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>         at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>         at
>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>         at
>> scala.collection.immutable.$colon$colon.readObject(List.scala:366)
>>         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>         at java.lang.reflect.Method.invoke(Method.java:497)
>>         at
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>>         at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>>         at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>         at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>         at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>>         at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>>         at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>         at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>         at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>>         at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>>         at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>         at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>         at
>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>         at
>> scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>>         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>         at java.lang.reflect.Method.invoke(Method.java:497)
>>         at
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>>         at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>>         at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>         at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>         at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>>         at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>>
>> --
>> Be well!
>> Jean Morozov
>>
>
>

Reply via email to