Can you try reducing maxBins? That reduces communication (at the cost of
coarser discretization of continuous features).
On Fri, Apr 1, 2016 at 11:32 AM, Joseph Bradley
wrote:
> In my experience, 20K is a lot but often doable; 2K is easy; 200 is
> small. Communication scales linearly in the nu
In my experience, 20K is a lot but often doable; 2K is easy; 200 is small.
Communication scales linearly in the number of features.
On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov
wrote:
> Joseph,
>
> Correction, there 20k features. Is it still a lot?
> What number of features can be considered
Joseph,
Correction, there 20k features. Is it still a lot?
What number of features can be considered as normal?
--
Be well!
Jean Morozov
On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley
wrote:
> First thought: 70K features is *a lot* for the MLlib implementation (and
> any PLANET-like implemen
One more thing.
With increased stack size it completed twice more already, but now I see in
the log.
[dispatcher-event-loop-1] WARN o.a.spark.scheduler.TaskSetManager - Stage
24860 contains a task of very large size (157 KB). The maximum recommended
task size is 100 KB.
Size of the task increas
Joseph,
I'm using 1.6.0.
--
Be well!
Jean Morozov
On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley
wrote:
> First thought: 70K features is *a lot* for the MLlib implementation (and
> any PLANET-like implementation)
>
> Using fewer partitions is a good idea.
>
> Which Spark version was this on?
First thought: 70K features is *a lot* for the MLlib implementation (and
any PLANET-like implementation)
Using fewer partitions is a good idea.
Which Spark version was this on?
On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov
wrote:
> The questions I have in mind:
>
> Is it smth that the one mi
The questions I have in mind:
Is it smth that the one might expect? From the stack trace itself it's not
clear where does it come from.
Is it an already known bug? Although I haven't found anything like that.
Is it possible to configure something to workaround / avoid this?
I'm not sure it's the
Hi,
I have a web service that provides rest api to train random forest algo.
I train random forest on a 5 nodes spark cluster with enough memory -
everything is cached (~22 GB).
On a small datasets up to 100k samples everything is fine, but with the
biggest one (400k samples and ~70k features) I'm