Re: Research of Spark scalability / performance issues

2015-08-29 Thread Reynold Xin
Both 2 and 3 are pretty good topics for master's project I think. You can also look into how one can improve Spark's scheduler throughput. Couple years ago Kay measured it but things have changed. It would be great to start with measurement, and then look at where the bottlenecks are, and see how

Re: Tungsten off heap memory access for C++ libraries

2015-08-29 Thread Reynold Xin
Supporting non-JVM code without memory copying and serialization is actually one of the motivations behind Tungsten. We didn't talk much about it since it is not end-user-facing and it is still too early. There are a few challenges still: 1. Spark cannot run entirely in off-heap mode (by entirely

Re: Tungsten off heap memory access for C++ libraries

2015-08-29 Thread Timothy Chen
I would also like to see data shared off-heap to a 3rd party C++ library with JNI, I think the complications would be how to memory manage this and make sure the 3rd party libraries also adhere to the access contracts as well. Tim On Sat, Aug 29, 2015 at 12:17 PM, Paul Weiss wrote: > Hi, > > Wou

Tungsten off heap memory access for C++ libraries

2015-08-29 Thread Paul Weiss
Hi, Would the benefits of project tungsten be available for access by non-JVM programs directly into the off-heap memory? Spark using dataframes w/ the tungsten improvements will definitely help analytics within the JVM world but accessing outside 3rd party c++ libraries is a challenge especially

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-29 Thread vaquar khan
+1 (1.5.0 RC2)Compiled on Windows with YARN. Regards, Vaquar khan +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min mvn clean package -Pyarn -Phadoop-2.6 -DskipTests 2. Tested pyspark, mllib 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ri