Seems that Github branch-1.5 already changing the version to
1.5.1-SNAPSHOT,
I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?
Chester
On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin wrote:
> I'm going to -1 the release myself since the issue @yhuai identified is
> pretty serious
Pandas performance is definitely the issue here. You're using Pandas as an
ETL system, and it's more suitable as an endpoint rather than an conduit.
That is, it's great to dump your data there and do your analysis within
Pandas, subject to its constraints, but if you need to "back out" and use
some
I'm going to -1 the release myself since the issue @yhuai identified is
pretty serious. It basically OOMs the driver for reading any files with a
large number of partitions. Looks like the patch for that has already been
merged.
I'm going to cut rc3 momentarily.
On Sun, Aug 30, 2015 at 11:30 AM,
tested now against Spark 1.5.0 rc2, and same exceptions happen when
num-executors > 2 :
15/08/25 10:31:10 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 5.0
(TID 501, xxx): java.lang.ClassCastException: java.lang.Double cannot
be cast to java.lang.Long
at scala.runtime.BoxesRunT
Thanks josh ... i'll take a look
On 31 Aug 2015 19:21, "Josh Rosen" wrote:
> There are currently a few known issues with using KryoSerializer as the
> closure serializer, so it's going to require some changes to Spark if we
> want to properly support this. See
> https://github.com/apache/spark/pu
Sounds good, want me to create a jira and link it to SPARK-9697? Will put
down some ideas to start.
On Aug 31, 2015 4:14 AM, "Reynold Xin" wrote:
> BTW if you are interested in this, we could definitely get some help in
> terms of prototyping the feasibility, i.e. how we can have a native (e.g.
>
There are two issues here:
1. Suppression of the true reason for failure. The spark runtime reports
"TypeError" but that is not why the operation failed.
2. The low performance of loading a pandas dataframe.
DISCUSSION
Number (1) is easily fixed, and the primary purpose for my post.
Number (2)
If you look at the recurrent issues in datacentre-scale computing systems, two
stand out
-resilience to failure: that's algorithms and the layers underneath (storage,
work allocation & tracking ...)
-scheduling: maximising resource utilisation while prioritising high-SLA work
(interactive thing
Hi devs,
Curently the only supported serializer for serializing tasks in
DAGScheduler.scala is JavaSerializer.
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array()
case stage: ResultStag
BTW if you are interested in this, we could definitely get some help in
terms of prototyping the feasibility, i.e. how we can have a native (e.g.
C++) API for data access shipped with Spark. There are a lot of questions
(e.g. build, portability) that need to be answered.
On Mon, Aug 31, 2015 at 1:
On Sun, Aug 30, 2015 at 5:58 AM, Paul Weiss wrote:
>
> Also, is this work being done on a branch I could look into further and
> try out?
>
>
We don't have a branch yet -- because there is no code nor design for this
yet. As I said, it is one of the motivations behind Tungsten, but it is
fairly e
11 matches
Mail list logo