Yes that's a great option when the modeling process itself doesn't really
need Spark. You can use any old modeling tool you want and get the
parallelism in tuning via hyperopt's Spark integration.
On Thu, Apr 1, 2021 at 10:50 AM Williams, David (Risk Value Stream)
wrote:
> Classification: Public
Value Stream)
Cc: user@spark.apache.org
Subject: Re: FW: Email to Spark Org please
-- This email has reached the Bank via an external source --
Right, could also be the case that the overhead of distributing it is just
dominating.
You wouldn't use sklearn with Spark, just use sklearn at this sc
uster. So if we get that working in distributed, will we get
> benefits similar to spark ML?
>
>
>
> Best Regards,
>
> Dave Williams
>
>
>
> *From:* Sean Owen
> *Sent:* 26 March 2021 13:20
> *To:* Williams, David (Risk Value Stream)
>
> *Cc:* user@spar
if we get that working in distributed, will we get
benefits similar to spark ML?
Best Regards,
Dave Williams
From: Sean Owen
Sent: 26 March 2021 13:20
To: Williams, David (Risk Value Stream)
Cc: user@spark.apache.org
Subject: Re: FW: Email to Spark Org please
-- This email has reached the Bank v
ent:* 25 March 2021 16:40
> *To:* Williams, David (Risk Value Stream) <
> david.willi...@lloydsbanking.com>
> *Cc:* user@spark.apache.org
> *Subject:* Re: FW: Email to Spark Org please
>
>
>
>
> *-- This email has reached the Bank via an external source -- *
>
> Spark is overk
David (Risk Value Stream)
mailto:david.willi...@lloydsbanking.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: FW: Email to Spark Org please
-- This email has reached the Bank via an external source --
Spark is overkill for this problem; use sklearn.
But I
Spark is overkill for this problem; use sklearn.
But I'd suspect that you are using just 1 partition for such a small data
set, and get no parallelism from Spark.
repartition your input to many more partitions, but, it's unlikely to get
much faster than in-core sklearn for this task.
On Thu, Mar 2
Classification: Public
Hi Team,
We are trying to utilize ML Gradient Boosting Tree Classification algorithm and
found the performance of the algorithm is very poor during training.
We would like to see we can improve the performance timings since, it is taking
2 days for training for a smaller