Both 2 and 3 are pretty good topics for master's project I think.
You can also look into how one can improve Spark's scheduler throughput.
Couple years ago Kay measured it but things have changed. It would be great
to start with measurement, and then look at where the bottlenecks are, and
see how
Supporting non-JVM code without memory copying and serialization is
actually one of the motivations behind Tungsten. We didn't talk much about
it since it is not end-user-facing and it is still too early. There are a
few challenges still:
1. Spark cannot run entirely in off-heap mode (by entirely
I would also like to see data shared off-heap to a 3rd party C++
library with JNI, I think the complications would be how to memory
manage this and make sure the 3rd party libraries also adhere to the
access contracts as well.
Tim
On Sat, Aug 29, 2015 at 12:17 PM, Paul Weiss wrote:
> Hi,
>
> Wou
Hi,
Would the benefits of project tungsten be available for access by non-JVM
programs directly into the off-heap memory? Spark using dataframes w/ the
tungsten improvements will definitely help analytics within the JVM world
but accessing outside 3rd party c++ libraries is a challenge especially
+1 (1.5.0 RC2)Compiled on Windows with YARN.
Regards,
Vaquar khan
+1 (non-binding, of course)
1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ri