Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
I'll try thanks Le ven. 24 avr. 2015 à 00:09, Reynold Xin a écrit : > You can do it similar to the way countDistinct is done, can't you? > > > https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L78 > > > > On Thu, Apr 23, 2015 at 1:59 PM, Olivier Girardot < > o.girar...@

Contributing Documentation Changes

2015-04-23 Thread madhu phatak
Hi, As I was reading contributing to Spark wiki, it was mentioned that we can contribute external links to spark tutorials. I have written many of them in my blog. It will be great if someone can add it to the spark website. Regards, Madhukara

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-23 Thread Sourav Chandra
*bump* On Thu, Apr 23, 2015 at 3:46 PM, Sourav Chandra < sourav.chan...@livestream.com> wrote: > HI TD, > > Some observations: > > 1. If I submit the application using spark-submit tool with *client as > deploy mode* it works fine with single master and worker (driver, master > and worker are run

First-class support for pip/virtualenv in pyspark

2015-04-23 Thread Justin Uang
Hi, I have been trying to figure out how to ship a python package that I have been working on, and this has brought up a couple questions to me. Please note that I'm fairly new to python package management, so any feedback/corrections is welcome =) It looks like the --py-files support we have mer

Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Sean Owen
Permission to change project roles is restricted to admins / PMC, naturally. I don't know if it's documented beyond this that Gavin helpfully pasted: https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme Another option is to make the list of people who you can assign to include

Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Luciano Resende
On Thu, Apr 23, 2015 at 5:26 PM, Sean Owen wrote: > Following my comment earlier that "I think we set Assignee for Fixed > JIRAs consistently", I found there are actually 880 counter examples. > Lots of them are old, and I'll try to fix as many that are recent (for > the 1.4.0 release credits) as

Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Luciano Resende
On Thu, Apr 23, 2015 at 5:47 PM, Hari Shreedharan wrote: > You’d need to add them as a contributor in the JIRA admin page. Once you > do that, you should be able to assign the jira to that person > > > Is this documented, and does every PMC (or committer) have access to do that ? > > > Thanks,

Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Hari Shreedharan
You’d need to add them as a contributor in the JIRA admin page. Once you do that, you should be able to assign the jira to that person Thanks, Hari On Thu, Apr 23, 2015 at 5:33 PM, Shivaram Venkataraman wrote: > A related question that has affected me in the past: If we get a PR from a > ne

Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Shivaram Venkataraman
A related question that has affected me in the past: If we get a PR from a new developer I sometimes find that I am not able to assign an issue to them after merging the PR. Is there a process we need follow to get new contributors on to a particular group in JIRA ? Or does it somehow happen automa

Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Sean Owen
Following my comment earlier that "I think we set Assignee for Fixed JIRAs consistently", I found there are actually 880 counter examples. Lots of them are old, and I'll try to fix as many that are recent (for the 1.4.0 release credits) as I can stand to click through. Let's set Assignee after res

RE: Should we let everyone set Assignee?

2015-04-23 Thread Sean Owen
The merge script automatically updates the linked JIRA after merging the PR (why it is important to put the JIRA in the title). It can't auto assign the JIRA since usernames dont match up but it is an easy reminder to set the Assignee. I do right after and I think other committers do too. I'll sea

RE: Should we let everyone set Assignee?

2015-04-23 Thread Ulanov, Alexander
My thinking is that current way of assigning a contributor after the patch is done (or almost done) is OK. Parallel efforts are also OK until they are discussed in the issue's thread. Ilya Ganelin made a good point that it is about moving the project forward. It also adds means of competition "w

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Reynold Xin
You can do it similar to the way countDistinct is done, can't you? https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L78 On Thu, Apr 23, 2015 at 1:59 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > I found another way setting a SPARK_HOME on a release

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread Joseph Bradley
I saw the PR already, but only saw this just now. I think both persists are useful based on my experience, but it's very hard to say in general. On Thu, Apr 23, 2015 at 12:22 PM, jimfcarroll wrote: > > Okay. > > PR: https://github.com/apache/spark/pull/5669 > > Jira: https://issues.apache.org/j

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
I found another way setting a SPARK_HOME on a released version and launching an ipython to load the contexts. I may need your insight however, I found why it hasn't been done at the same time, this method (like some others) uses a varargs in Scala and for now the way functions are called only one p

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Reynold Xin
You need to first have the Spark assembly jar built with "sbt/sbt assembly/assembly" Then usually I go into python/run-tests and comment out the non-SQL tests: #run_core_tests run_sql_tests #run_mllib_tests #run_ml_tests #run_streaming_tests And then you can run "python/run-tests" On Thu, Ap

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
What is the way of testing/building the pyspark part of Spark ? Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot < o.girar...@lateral-thoughts.com> a écrit : > yep :) I'll open the jira when I've got the time. > Thanks > > Le jeu. 23 avr. 2015 à 19:31, Reynold Xin a écrit : > >> Ah damn. We need t

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
yep :) I'll open the jira when I've got the time. Thanks Le jeu. 23 avr. 2015 à 19:31, Reynold Xin a écrit : > Ah damn. We need to add it to the Python list. Would you like to give it a > shot? > > > On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: >

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread jimfcarroll
Okay. PR: https://github.com/apache/spark/pull/5669 Jira: https://issues.apache.org/jira/browse/SPARK-7100 Hope that helps. Let me know if you need anything else. Jim -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-per

Re: [discuss] new Java friendly InputSource API

2015-04-23 Thread Reynold Xin
In the ctor of InputSource (I'm also considering adding an explicit initialize call), the implementation of InputSource can execute arbitrary code. The state in it will also be serialized and passed onto the executors. Yes - technically you can hijack getSplits in Hadoop InputFormat to do the same

Re: [discuss] new Java friendly InputSource API

2015-04-23 Thread Mingyu Kim
Hi Reynold, You mentioned that the new API allows arbitrary code to be run on the driver side, but it¹s not very clear to me how this is different from what Hadoop API provides. In your example of using broadcast, did you mean broadcasting something in InputSource.getPartitions() and having InputP

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Reynold Xin
Ah damn. We need to add it to the Python list. Would you like to give it a shot? On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Yep no problem, but I can't seem to find the coalesce fonction in > pyspark.sql.{*, functions, types or whatever :) } > >

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread Sean Owen
Those are different RDDs that DecisionTree persists, though. It's not redundant. On Thu, Apr 23, 2015 at 11:12 AM, jimfcarroll wrote: > Hi Sean and Joe, > > I have another question. > > GradientBoostedTrees.run iterates over the RDD calling DecisionTree.run on > each iteration with a new random s

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread jimfcarroll
Hi Sean and Joe, I have another question. GradientBoostedTrees.run iterates over the RDD calling DecisionTree.run on each iteration with a new random sample from the input RDD. DecisionTree.run calls RandomForest.run. which also calls persist. One of these seems superfluous. Should I simply re

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread Sean Owen
Only against master; it can be cherry-picked to other branches. On Thu, Apr 23, 2015 at 10:53 AM, jimfcarroll wrote: > Hi Joe, > > Do you want a PR per branch (one for master, one for 1.3)? Are you still > maintaining 1.2? Do you need a Jira ticket per PR or can I submit them all > under the same

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread jimfcarroll
Hi Joe, Do you want a PR per branch (one for master, one for 1.3)? Are you still maintaining 1.2? Do you need a Jira ticket per PR or can I submit them all under the same ticket? Or should I just submit it to master and let you guys back-port it? Jim -- View this message in context: http://

Contributors, read me! Updated Contributing to Spark wiki

2015-04-23 Thread Sean Owen
Following several discussions about how to improve the contribution process in Spark, I've overhauled the guide to contributing. Anyone who is going to contribute needs to read it, as it has more formal guidance about the process: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+S

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
Yep no problem, but I can't seem to find the coalesce fonction in pyspark.sql.{*, functions, types or whatever :) } Olivier. Le lun. 20 avr. 2015 à 11:48, Olivier Girardot < o.girar...@lateral-thoughts.com> a écrit : > a UDF might be a good idea no ? > > Le lun. 20 avr. 2015 à 11:17, Olivier Gir

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-23 Thread Sourav Chandra
HI TD, Some observations: 1. If I submit the application using spark-submit tool with *client as deploy mode* it works fine with single master and worker (driver, master and worker are running in same machine) 2. If I submit the application using spark-submit tool with client as deploy mode it *c

Re: Indices of SparseVector must be ordered while computing SVD

2015-04-23 Thread Sean Owen
I think we discussed this a while ago (?) and the problem was the overhead of even verifying the sorted state took too long. On Thu, Apr 23, 2015 at 3:31 AM, Joseph Bradley wrote: > Hi Chunnan, > > There is currently Scala documentation for the constructor parameters: > https://github.com/apache/

Re: In Spark-SQL, is there support for distributed execution of native Hive UDAFs?

2015-04-23 Thread Reynold Xin
Your understanding is correct -- there is no partial aggregation currently for Hive UDAF. However, there is a PR to fix that: https://github.com/apache/spark/pull/5542 On Thu, Apr 23, 2015 at 1:30 AM, daniel.mescheder < daniel.mesche...@realimpactanalytics.com> wrote: > Hi everyone, > > I was

In Spark-SQL, is there support for distributed execution of native Hive UDAFs?

2015-04-23 Thread daniel.mescheder
Hi everyone, I was playing with the integration of Hive UDAFs in Spark-SQL and noticed that the terminatePartial and merge methods of custom UDAFs were not called. This made me curious as those two methods are the ones responsible for distributing the UDAF execution in Hive. Looking at the code