Re: Make off-heap store pluggable

2015-07-20 Thread Sean Owen
(Related, not important comment: it would also be nice to separate out the Tachyon dependency from core, as it's conceptually pluggable but is still hard-coded into several places in the code, and a lot of the comments/docs in the code.) On Tue, Jul 21, 2015 at 5:40 AM, Reynold Xin wrote: > I se

Re: Make off-heap store pluggable

2015-07-20 Thread Matei Zaharia
I agree with this -- basically, to build on Reynold's point, you should be able to get almost the same performance by implementing either the Hadoop FileSystem API or the Spark Data Source API over Ignite in the right way. This would let people save data persistently in Ignite in addition to usi

Re: Make off-heap store pluggable

2015-07-20 Thread Reynold Xin
I sent it prematurely. They are already pluggable, or at least in the process to be more pluggable. In 1.4, instead of calling the external system's API directly, we added an API for that. There is a patch to add support for HDFS in-memory cache. Somewhat orthogonal to this, longer term, I am no

Re: Make off-heap store pluggable

2015-07-20 Thread Reynold Xin
They are already pluggable. On Mon, Jul 20, 2015 at 9:32 PM, Prashant Sharma wrote: > +1 Looks like a nice idea(I do not see any harm). Would you like to work > on the patch to support it ? > > Prashant Sharma > > > > On Tue, Jul 21, 2015 at 2:46 AM, Alexey Goncharuk < > alexey.goncha...@gmail.

Re: Make off-heap store pluggable

2015-07-20 Thread Prashant Sharma
+1 Looks like a nice idea(I do not see any harm). Would you like to work on the patch to support it ? Prashant Sharma On Tue, Jul 21, 2015 at 2:46 AM, Alexey Goncharuk < alexey.goncha...@gmail.com> wrote: > Hello Spark community, > > I was looking through the code in order to understand better

Make off-heap store pluggable

2015-07-20 Thread Alexey Goncharuk
Hello Spark community, I was looking through the code in order to understand better how RDD is persisted to Tachyon off-heap filesystem. It looks like that the Tachyon filesystem is hard-coded and there is no way to switch to another in-memory filesystem. I think it would be great if the implement

Re: Should spark-ec2 get its own repo?

2015-07-20 Thread Mridul Muralidharan
Might be a good idea to get the PMC's of both projects to sign off to prevent future issues with apache. Regards, Mridul On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman wrote: > I've created https://github.com/amplab/spark-ec2 and added an initial set of > committers. Note that this is n

Re: Should spark-ec2 get its own repo?

2015-07-20 Thread Shivaram Venkataraman
Technically I think the project ends in 2017 and I think we will figure out a transition for AMPLab repositories when the project ends. I think it should be pretty simple to transfer ownership to a new organization if / when the time comes around. Thanks Shivaram On Mon, Jul 20, 2015 at 12:03 PM,

Re: Worker memory leaks?

2015-07-20 Thread Richard Marscher
Hi, thanks for the follow up. You are right regarding the invalidation of observation #2. I later realized the Worker UI page directly displays the entries in the executors map and can see in our production UI it's in a proper state. As for the Killed vs Exited, it's less relevant now since the t

Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
Hi, I’m looking at the online docs for building spark 1.4.1 … http://spark.apache.org/docs/latest/building-spark.html I was interested in building spark for Scala 2.11 (latest scala) and also for Hive and JDBC support. The docs say

Re: Worker memory leaks?

2015-07-20 Thread Josh Rosen
Hi Richard, Thanks for your detailed investigation of this issue. I agree with your observation that the finishedExecutors hashmap is a source of memory leaks for very-long-lived clusters. It looks like the finishedExecutors map is only read when rendering the Worker Web UI and in constructing R

Re: Should spark-ec2 get its own repo?

2015-07-20 Thread Reynold Xin
Is amplab the right owner, given its ending next year? Maybe we should create spark-ec2, or spark-project instead? On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > I've created https://github.com/amplab/spark-ec2 and added an initial set > of committ

Re: Should spark-ec2 get its own repo?

2015-07-20 Thread Shivaram Venkataraman
I've created https://github.com/amplab/spark-ec2 and added an initial set of committers. Note that this is not a fork of the existing github.com/mesos/spark-ec2 and users will need to fork from here. This is mostly to avoid the base-fork in pull requests being set incorrectly etc. I'll be migratin

Worker memory leaks?

2015-07-20 Thread Richard Marscher
Hi, we have been experiencing issues in production over the past couple weeks with Spark Standalone Worker JVMs seeming to have memory leaks. They accumulate Old Gen until it reaches max and then reach a failed state that starts critically failing some applications running against the cluster. I'

Re: If gmail, check sparm

2015-07-20 Thread Richard Marscher
I've setup filters in my gmail to avoid this. If you have a filter matching the mailing list(s) you can set a flag in Gmail to never send it to spam. On Sat, Jul 18, 2015 at 7:28 PM, Ted Yu wrote: > Interesting read. > > I did find a lot of Spark mails in Spam folder. > > Thanks Mridul > > > > >

Re: KinesisStreamSuite failing in master branch

2015-07-20 Thread Ted Yu
TD: Thanks for getting the builds back to green. On Sun, Jul 19, 2015 at 7:21 PM, Tathagata Das wrote: > The PR to fix this is out. > https://github.com/apache/spark/pull/7519 > > > On Sun, Jul 19, 2015 at 6:41 PM, Tathagata Das > wrote: > >> I am taking care of this right now. >> >> On Sun, Ju

Re: countByValue on dataframe with multiple columns

2015-07-20 Thread Jonathan Winandy
Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType df.groupBy("n").count().show() // generic def countByValueDf(df:

Re: KryoSerializer gives class cast exception

2015-07-20 Thread Eugene Morozov
Josh, thanks for the reply. So, it looks like despite the progress there is no other way as to fork and fix the chill itself. It indeed doesn’t compile with kryo 2.24.0, but it wasn’t that hard to fix (looks like I’ve just guessed the right code), although there are test failures now. On 17 Ju

countByValue on dataframe with multiple columns

2015-07-20 Thread Olivier Girardot
Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categor

Re: Compact RDD representation

2015-07-20 Thread Juan Rodríguez Hortalá
Hi, I'm not an authority in the Spark community, but what I would do is adding the project to spark packages http://spark-packages.org/. In fact I think this case is similar to IndexedRDD, which is also in spark packages http://spark-packages.org/package/amplab/spark-indexedrdd 2015-07-19 21:49 G

Re: Dynamic resource allocation in Standalone mode

2015-07-20 Thread Andrew Or
Hi Ray, In standalone mode, you have this thing called the SparkDeploySchedulerBackend, which has this thing called the AppClient. This is the thing on the driver side that already talks to the Master to register the application. As for dynamic allocation in standalone mode, I literally *just* cr

Re: [discuss] Removing individual commit messages from the squash commit message

2015-07-20 Thread Manoj Kumar
+1 Sounds like a great idea. On Sun, Jul 19, 2015 at 10:54 PM, Sandy Ryza wrote: > +1 > > On Sat, Jul 18, 2015 at 4:00 PM, Mridul Muralidharan > wrote: > >> Thanks for detailing, definitely sounds better. >> +1 >> >> Regards >> Mridul >> >> On Saturday, July 18, 2015, Reynold Xin wrote: >> >>

Re: Foundation policy on releases and Spark nightly builds

2015-07-20 Thread Reynold Xin
Thanks, Sean. On Mon, Jul 20, 2015 at 12:22 AM, Sean Owen wrote: > This is done, and yes I believe that resolves the issue as far all here > know. > > http://spark.apache.org/downloads.html > -> > > https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Ni

Re: Foundation policy on releases and Spark nightly builds

2015-07-20 Thread Sean Owen
This is done, and yes I believe that resolves the issue as far all here know. http://spark.apache.org/downloads.html -> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-NightlyBuilds On Sun, Jul 19, 2015 at 5:26 PM, Patrick Wendell wrote: > Hey Sean,