Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Shivaram Venkataraman
Yeah I'll send a note to the mesos dev list just to make sure they are informed. Shivaram On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen wrote: > I agree it's worth informing Mesos devs and checking that there are no > big objections. I presume Shivaram is plugged in enough to Mesos that > there w

What is the difference between SlowSparkPullRequestBuilder and SparkPullRequestBuilder?

2015-07-21 Thread Yu Ishikawa
Hi all, When we send a PR, it seems that two requests to run tests are thrown to the Jenkins sometimes. What is the difference between SparkPullRequestBuilder and SlowSparkPullRequestBuilder? Thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-lis

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
I added the following jira https://issues.apache.org/jira/browse/SPARK-9237 Please help me get it assigned to myself thanks. Ted Malaska On Tue, Jul 21, 2015 at 7:53 PM, Ted Malaska wrote: > Cool I will make a jira after I check in to my hotel. And try to get a > patch early next week. > On

-Phive-thriftserver when compiling for use in pyspark and JDBC connections

2015-07-21 Thread Aaron
I compile/make a distribution, with either the 1.4 branch or master, using the -Phive-thriftserver, and attempt a JDBC connection to a mysql DB..using latest connector (5.1.36) jar. When I setup the pyspark shell doing: bin/pyspark --jars mysql-connection...jar --driver-class-path mysql-connecto

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
Cool I will make a jira after I check in to my hotel. And try to get a patch early next week. On Jul 21, 2015 5:15 PM, "Olivier Girardot" wrote: > yes and freqItems does not give you an ordered count (right ?) + the > threshold makes it difficult to calibrate it + we noticed some strange > behav

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Olivier Girardot
yes and freqItems does not give you an ordered count (right ?) + the threshold makes it difficult to calibrate it + we noticed some strange behaviour when testing it on small datasets. 2015-07-21 20:30 GMT+02:00 Ted Malaska : > Look at the implementation for frequently items. It is a different f

Re: Foundation policy on releases and Spark nightly builds

2015-07-21 Thread Sean Busbey
Looks good to me. Thanks for helping find a common ground everyone, and Sean for handling the implementation. On Mon, Jul 20, 2015 at 2:22 AM, Sean Owen wrote: > This is done, and yes I believe that resolves the issue as far all here > know. > > http://spark.apache.org/downloads.html > -> > > ht

Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Mridul Muralidharan
That sounds good. Thanks for clarifying ! Regards, Mridul On Tue, Jul 21, 2015 at 11:09 AM, Shivaram Venkataraman wrote: > Thats part of the confusion we are trying to fix here -- the repository used > to live in the mesos github account but was never a part of the Apache Mesos > project. It wa

Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Sean Owen
I agree it's worth informing Mesos devs and checking that there are no big objections. I presume Shivaram is plugged in enough to Mesos that there won't be any surprises there, and that the project would also agree with moving this Spark-specific bit out. they may also want to leave a pointer to th

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
Look at the implementation for frequently items. It is a different from true count. On Jul 21, 2015 1:19 PM, "Reynold Xin" wrote: > Is this just frequent items? > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 > > > >

Re: Make off-heap store pluggable

2015-07-21 Thread Zhan Zhang
Hi Alexey, SPARK-6479 is for the plugin API, and SPARK-6112 is for hdfs plugin. Thanks. Zhan Zhang On Jul 21, 2015, at 10:56 AM, Alexey Goncharuk mailto:alexey.goncha...@gmail.com>> wrote:

Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Shivaram Venkataraman
Thats part of the confusion we are trying to fix here -- the repository used to live in the mesos github account but was never a part of the Apache Mesos project. It was a remnant part of Spark from when Spark used to live at github.com/mesos/spark. Shivaram On Tue, Jul 21, 2015 at 11:03 AM, Mrid

Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Mridul Muralidharan
If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman wrote: > There is technically no PMC for the spark-ec2 project (I guess we are kind > o

Re: Make off-heap store pluggable

2015-07-21 Thread Alexey Goncharuk
2015-07-20 23:29 GMT-07:00 Matei Zaharia : > I agree with this -- basically, to build on Reynold's point, you should be > able to get almost the same performance by implementing either the Hadoop > FileSystem API or the Spark Data Source API over Ignite in the right way. > This would let people sa

Re: Make off-heap store pluggable

2015-07-21 Thread Alexey Goncharuk
2015-07-20 21:32 GMT-07:00 Prashant Sharma : > +1 Looks like a nice idea(I do not see any harm). Would you like to work > on the patch to support it ? > > Prashant Sharma > > Yes, I would like to contribute to it once we clarify the appropriate path. --Alexey > > > On Tue, Jul 21, 2015 at 2:46

Re: Make off-heap store pluggable

2015-07-21 Thread Alexey Goncharuk
2015-07-20 21:40 GMT-07:00 Reynold Xin : > I sent it prematurely. > > They are already pluggable, or at least in the process to be more > pluggable. In 1.4, instead of calling the external system's API directly, > we added an API for that. There is a patch to add support for HDFS > in-memory cach

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Reynold Xin
Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska wrote: > 100% I would love to do it. Who a good person to review the design with. > All I need i

Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Shivaram Venkataraman
There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / P

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi Ted, > The T

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Olivier Girardot
Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Jonathan Winandy
Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Olivier Girardot
Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (wh