Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-31 Thread Chester Chen
Seems that Github branch-1.5 already changing the version to 1.5.1-SNAPSHOT, I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ? Chester On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin wrote: > I'm going to -1 the release myself since the issue @yhuai identified is > pretty serious

Re: IOError on createDataFrame

2015-08-31 Thread Philip
Pandas performance is definitely the issue here. You're using Pandas as an ETL system, and it's more suitable as an endpoint rather than an conduit. That is, it's great to dump your data there and do your analysis within Pandas, subject to its constraints, but if you need to "back out" and use some

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-31 Thread Reynold Xin
I'm going to -1 the release myself since the issue @yhuai identified is pretty serious. It basically OOMs the driver for reading any files with a large number of partitions. Looks like the patch for that has already been merged. I'm going to cut rc3 momentarily. On Sun, Aug 30, 2015 at 11:30 AM,

Re: ClassCastException using DataFrame only when num-executors > 2 ...

2015-08-31 Thread Olivier Girardot
tested now against Spark 1.5.0 rc2, and same exceptions happen when num-executors > 2 : 15/08/25 10:31:10 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 5.0 (TID 501, xxx): java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Long at scala.runtime.BoxesRunT

Re: KryoSerializer for closureSerializer in DAGScheduler

2015-08-31 Thread yash datta
Thanks josh ... i'll take a look On 31 Aug 2015 19:21, "Josh Rosen" wrote: > There are currently a few known issues with using KryoSerializer as the > closure serializer, so it's going to require some changes to Spark if we > want to properly support this. See > https://github.com/apache/spark/pu

Re: Tungsten off heap memory access for C++ libraries

2015-08-31 Thread Paul Weiss
Sounds good, want me to create a jira and link it to SPARK-9697? Will put down some ideas to start. On Aug 31, 2015 4:14 AM, "Reynold Xin" wrote: > BTW if you are interested in this, we could definitely get some help in > terms of prototyping the feasibility, i.e. how we can have a native (e.g. >

Re: IOError on createDataFrame

2015-08-31 Thread fsacerdoti
There are two issues here: 1. Suppression of the true reason for failure. The spark runtime reports "TypeError" but that is not why the operation failed. 2. The low performance of loading a pandas dataframe. DISCUSSION Number (1) is easily fixed, and the primary purpose for my post. Number (2)

Re: Research of Spark scalability / performance issues

2015-08-31 Thread Steve Loughran
If you look at the recurrent issues in datacentre-scale computing systems, two stand out -resilience to failure: that's algorithms and the layers underneath (storage, work allocation & tracking ...) -scheduling: maximising resource utilisation while prioritising high-SLA work (interactive thing

KryoSerializer for closureSerializer in DAGScheduler

2015-08-31 Thread yash datta
Hi devs, Curently the only supported serializer for serializing tasks in DAGScheduler.scala is JavaSerializer. val taskBinaryBytes: Array[Byte] = stage match { case stage: ShuffleMapStage => closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array() case stage: ResultStag

Re: Tungsten off heap memory access for C++ libraries

2015-08-31 Thread Reynold Xin
BTW if you are interested in this, we could definitely get some help in terms of prototyping the feasibility, i.e. how we can have a native (e.g. C++) API for data access shipped with Spark. There are a lot of questions (e.g. build, portability) that need to be answered. On Mon, Aug 31, 2015 at 1:

Re: Tungsten off heap memory access for C++ libraries

2015-08-31 Thread Reynold Xin
On Sun, Aug 30, 2015 at 5:58 AM, Paul Weiss wrote: > > Also, is this work being done on a branch I could look into further and > try out? > > We don't have a branch yet -- because there is no code nor design for this yet. As I said, it is one of the motivations behind Tungsten, but it is fairly e