In Spark-SQL, is there support for distributed execution of native Hive UDAFs?

2015-04-23 Thread daniel.mescheder
Hi everyone,

I was playing with the integration of Hive UDAFs in Spark-SQL and noticed that 
the terminatePartial and merge methods of custom UDAFs were not called. This 
made me curious as those two methods are the ones responsible for distributing 
the UDAF execution in Hive.
Looking at the code of HiveUdafFunction which seems to be the wrapper for all 
native Hive functions for which there exists no spark-sql specific 
implementation, I noticed that it

a) extends AggregateFunction and not PartialAggregate
b) only contains calls to iterate and evaluate, but never to merge of the 
underlying UDAFEvaluator object

My question is thus twofold: Is my observation correct, that to achieve 
distributed execution of a UDAF I have to add a custom implementation at the 
spark-sql layer (like the examples in aggregates.scala)? If that is the case, 
how difficult would it be to use the terminatePartial and merge functions 
provided by the UDAFEvaluator to make Hive UDAFs distributed by default?


Cheers,

Daniel



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/In-Spark-SQL-is-there-support-for-distributed-execution-of-native-Hive-UDAFs-tp11753.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: In Spark-SQL, is there support for distributed execution of native Hive UDAFs?

2015-04-23 Thread Reynold Xin
Your understanding is correct -- there is no partial aggregation currently
for Hive UDAF.

However, there is a PR to fix that:
https://github.com/apache/spark/pull/5542



On Thu, Apr 23, 2015 at 1:30 AM, daniel.mescheder <
daniel.mesche...@realimpactanalytics.com> wrote:

> Hi everyone,
>
> I was playing with the integration of Hive UDAFs in Spark-SQL and noticed
> that the terminatePartial and merge methods of custom UDAFs were not
> called. This made me curious as those two methods are the ones responsible
> for distributing the UDAF execution in Hive.
> Looking at the code of HiveUdafFunction which seems to be the wrapper for
> all native Hive functions for which there exists no spark-sql specific
> implementation, I noticed that it
>
> a) extends AggregateFunction and not PartialAggregate
> b) only contains calls to iterate and evaluate, but never to merge of the
> underlying UDAFEvaluator object
>
> My question is thus twofold: Is my observation correct, that to achieve
> distributed execution of a UDAF I have to add a custom implementation at
> the spark-sql layer (like the examples in aggregates.scala)? If that is the
> case, how difficult would it be to use the terminatePartial and merge
> functions provided by the UDAFEvaluator to make Hive UDAFs distributed by
> default?
>
>
> Cheers,
>
> Daniel
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/In-Spark-SQL-is-there-support-for-distributed-execution-of-native-Hive-UDAFs-tp11753.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.


Re: Indices of SparseVector must be ordered while computing SVD

2015-04-23 Thread Sean Owen
I think we discussed this a while ago (?) and the problem was the
overhead of even verifying the sorted state took too long.

On Thu, Apr 23, 2015 at 3:31 AM, Joseph Bradley  wrote:
> Hi Chunnan,
>
> There is currently Scala documentation for the constructor parameters:
> https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L515
>
> There is one benefit to not checking for validity (ordering) within the
> constructor: If you need to translate between SparseVector and some other
> library's type (e.g., Breeze), you can do so with a few reference copies,
> rather than iterating through or copying the actual data.  It might be good
> to provide this check within Vectors.sparse(), but we'd need to check
> through MLlib for uses of Vectors.sparse which expect it to be a cheap
> operation.  What do you think?
>
> It is documented in the programming guide too:
> https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/docs/mllib-data-types.md
> But perhaps that should be more prominent.
>
> If you think it would be helpful, then please do make a JIRA about adding a
> check to Vectors.sparse().
>
> Joseph
>
> On Wed, Apr 22, 2015 at 8:29 AM, Chunnan Yao  wrote:
>
>> Hi all,
>> I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This
>> really
>> confused me today. At first I thought my implementation is wrong. It turns
>> out it's an issue in MLlib. Fortunately, I've figured it out.
>>
>> I suggest to add a hint on user document of MLlib ( as far as I know, there
>> have not been such hints yet) that  indices of Local Sparse Vector must be
>> ordered in ascending manner. Because of ignorance of this point, I spent a
>> lot of time looking for reasons why computeSVD of RowMatrix did not run
>> correctly on Sparse data. I don't know the influence of Sparse Vector
>> without ordered indices on other functions, but I believe it is necessary
>> to
>> let the users know or fix it. Actually, it's very easy to fix. Just add a
>> sortBy function in internal construction of SparseVector.
>>
>> Here is an example to reproduce the affect of unordered Sparse Vector on
>> computeSVD.
>> 
>> //in spark-shell, Spark 1.3.1
>>  import org.apache.spark.mllib.linalg.distributed.RowMatrix
>>  import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector,
>> Vectors}
>>
>>   val sparseData_ordered = Seq(
>> Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)),
>> Vectors.sparse(3, Array(0,1,2), Array(3.0, 4.0, 5.0)),
>> Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)),
>> Vectors.sparse(3, Array(0,2), Array(9.0, 1.0))
>>   )
>>   val sparseMat_ordered = new RowMatrix(sc.parallelize(sparseData_ordered,
>> 2))
>>
>>   val sparseData_not_ordered = Seq(
>> Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)),
>> Vectors.sparse(3, Array(2,1,0), Array(5.0,4.0,3.0)),
>> Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)),
>> Vectors.sparse(3, Array(2,0), Array(1.0,9.0))
>>   )
>>  val sparseMat_not_ordered = new
>> RowMatrix(sc.parallelize(sparseData_not_ordered, 2))
>>
>> //apparently, sparseMat_ordered and sparseMat_not_ordered are essentially
>> the same matirx
>> //however, the computeSVD result of these two matrixes are different. Users
>> should be notified about this situation.
>>   println(sparseMat_ordered.computeSVD(2,
>> true).U.rows.collect.mkString("\n"))
>>   println("===")
>>   println(sparseMat_not_ordered.computeSVD(2,
>> true).U.rows.collect.mkString("\n"))
>> ==
>> The results are:
>> ordered:
>> [-0.10972870132786407,-0.18850811494220537]
>> [-0.44712472003608356,-0.24828866611663725]
>> [-0.784520738744303,-0.3080692172910691]
>> [-0.4154110101064339,0.8988385762953358]
>>
>> not ordered:
>> [-0.10830447119599484,-0.1559341848984378]
>> [-0.4522713511277327,-0.23449829541447448]
>> [-0.7962382310594706,-0.3130624059305111]
>> [-0.43131320303494614,0.8453864703362308]
>>
>> Looking into this issue, I can see it's reason locates in
>> RowMatrix.scala(line 629). The implementation of Sparse dspr here requires
>> ordered indices. Because it is scanning the indices consecutively to skip
>> empty columns.
>>
>>
>>
>> -
>> Feel the sparking Spark!
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Indices-of-SparseVector-must-be-ordered-while-computing-SVD-tp11731.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, 

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-23 Thread Sourav Chandra
HI TD,

Some observations:

1. If I submit the application using spark-submit tool with *client as
deploy mode* it works fine with single master and worker (driver, master
and worker are running in same machine)
2. If I submit the application using spark-submit tool with client as
deploy mode it *crashes after some time with  StackOverflowError* *single
master and 2 workers* (driver, master and 1 worker is running in same
machine, other
worker is in different machine)
 *15/04/23 05:42:04 Executor: Exception in task 0.0 in stage 23153.0
(TID 5412)*
*java.lang.StackOverflowError*
*at
java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2864)*
*at java.io.ObjectInputStream.readUTF(ObjectInputStream.java:1072)*
*at
java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:671)*
*at
java.io.ObjectInputStream.readClassDescriptor(ObjectInputStream.java:830)*
*at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1601)*
*at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)*
*at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)*
*at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
*at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
*at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
*at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
*at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
*at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
*at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
*at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
*at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
*at
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)*
*at
scala.collection.immutable.$colon$colon.readObject(List.scala:362)*
*at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)*
*at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
*at java.lang.reflect.Method.invoke(Method.java:606)*
*at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)*
*at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)*
*at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
*at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
*at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
*at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
*at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
*at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
*at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
*at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
*at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
*at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
*at
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)*
*at
scala.collection.immutable.$colon$colon.readObject(List.scala:362)*
*at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)*
*at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
*at java.lang.reflect.Method.invoke(Method.java:606)*
*at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)*
*at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)*
*at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
*at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
*at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
*at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
*at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
*at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
*at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
*at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
*at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
*at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
*at
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)*
*at
scala.collection.immutable.$colon$colon.readObject(List.scala:366)*
*at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)*


3. If I submit the application

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
Yep no problem, but I can't seem to find the coalesce fonction in
pyspark.sql.{*, functions, types or whatever :) }

Olivier.

Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
o.girar...@lateral-thoughts.com> a écrit :

> a UDF might be a good idea no ?
>
> Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
> o.girar...@lateral-thoughts.com> a écrit :
>
>> Hi everyone,
>> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna* API
>> in PySpark, is there any efficient alternative to mapping the records
>> myself ?
>>
>> Regards,
>>
>> Olivier.
>>
>


Contributors, read me! Updated Contributing to Spark wiki

2015-04-23 Thread Sean Owen
Following several discussions about how to improve the contribution
process in Spark, I've overhauled the guide to contributing. Anyone
who is going to contribute needs to read it, as it has more formal
guidance about the process:

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

We may push back harder now on pull requests and JIRAs that don't
follow this guidance. It will help everyone spend less time to get
changes in, and spend less time on duplicated effort, or changes that
won't.

A summary of key points is found in CONTRIBUTING.md, a prompt
presented before opening pull requests
(https://github.com/apache/spark/blob/master/CONTRIBUTING.md):

- Is the change important and ready enough to ask the community to
spend time reviewing?
- Have you searched for existing, related JIRAs and pull requests?
- Is this a new feature that can stand alone as a package on
http://spark-packages.org ?
- Is the change being proposed clearly explained and motivated?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread jimfcarroll
Hi Joe,

Do you want a PR per branch (one for master, one for 1.3)? Are you still
maintaining 1.2? Do you need a Jira ticket per PR or can I submit them all
under the same ticket?

Or should I just submit it to master and let you guys back-port it?

Jim




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-persisted-RDD-tp11750p11759.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread Sean Owen
Only against master; it can be cherry-picked to other branches.

On Thu, Apr 23, 2015 at 10:53 AM, jimfcarroll  wrote:
> Hi Joe,
>
> Do you want a PR per branch (one for master, one for 1.3)? Are you still
> maintaining 1.2? Do you need a Jira ticket per PR or can I submit them all
> under the same ticket?
>
> Or should I just submit it to master and let you guys back-port it?
>
> Jim
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-persisted-RDD-tp11750p11759.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread jimfcarroll
Hi Sean and Joe,

I have another question. 

GradientBoostedTrees.run iterates over the RDD calling DecisionTree.run on
each iteration with a new random sample from the input RDD. DecisionTree.run
calls RandomForest.run. which also calls persist.

One of these seems superfluous.

Should I simply remove the persist call at the GradientBoostedTrees level?

Thanks
Jim




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-persisted-RDD-tp11750p11762.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread Sean Owen
Those are different RDDs that DecisionTree persists, though. It's not redundant.

On Thu, Apr 23, 2015 at 11:12 AM, jimfcarroll  wrote:
> Hi Sean and Joe,
>
> I have another question.
>
> GradientBoostedTrees.run iterates over the RDD calling DecisionTree.run on
> each iteration with a new random sample from the input RDD. DecisionTree.run
> calls RandomForest.run. which also calls persist.
>
> One of these seems superfluous.
>
> Should I simply remove the persist call at the GradientBoostedTrees level?
>
> Thanks
> Jim
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-persisted-RDD-tp11750p11762.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Reynold Xin
Ah damn. We need to add it to the Python list. Would you like to give it a
shot?


On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Yep no problem, but I can't seem to find the coalesce fonction in
> pyspark.sql.{*, functions, types or whatever :) }
>
> Olivier.
>
> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
> o.girar...@lateral-thoughts.com> a écrit :
>
> > a UDF might be a good idea no ?
> >
> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
> > o.girar...@lateral-thoughts.com> a écrit :
> >
> >> Hi everyone,
> >> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna* API
> >> in PySpark, is there any efficient alternative to mapping the records
> >> myself ?
> >>
> >> Regards,
> >>
> >> Olivier.
> >>
> >
>


Re: [discuss] new Java friendly InputSource API

2015-04-23 Thread Mingyu Kim
Hi Reynold,

You mentioned that the new API allows arbitrary code to be run on the
driver side, but it¹s not very clear to me how this is different from what
Hadoop API provides. In your example of using broadcast, did you mean
broadcasting something in InputSource.getPartitions() and having
InputPartitions use the broadcast variables? Isn¹t that already possible
with Hadoop's InputFormat.getSplits()?

Thanks,
Mingyu





On 4/21/15, 4:33 PM, "Soren Macbeth"  wrote:

>I'm also super interested in this. Flambo (our clojure DSL) wraps the java
>api and it would be great to have this.
>
>On Tue, Apr 21, 2015 at 4:10 PM, Reynold Xin  wrote:
>
>> It can reuse. That's a good point and we should document it in the API
>> contract.
>>
>>
>> On Tue, Apr 21, 2015 at 4:06 PM, Punyashloka Biswal <
>> punya.bis...@gmail.com>
>> wrote:
>>
>> > Reynold, thanks for this! At Palantir we're heavy users of the Java
>>APIs
>> > and appreciate being able to stop hacking around with fake ClassTags
>>:)
>> >
>> > Regarding this specific proposal, is the contract of RecordReader#get
>> > intended to be that it returns a fresh object each time? Or is it
>>allowed
>> > to mutate a fixed object and return a pointer to it each time?
>> >
>> > Put another way, is a caller supposed to clone the output of get() if
>> they
>> > want to use it later?
>> >
>> > Punya
>> >
>> > On Tue, Apr 21, 2015 at 4:35 PM Reynold Xin 
>>wrote:
>> >
>> >> I created a pull request last night for a new InputSource API that is
>> >> essentially a stripped down version of the RDD API for providing data
>> into
>> >> Spark. Would be great to hear the community's feedback.
>> >>
>> >> Spark currently has two de facto input source API:
>> >> 1. RDD
>> >> 2. Hadoop MapReduce InputFormat
>> >>
>> >> Neither of the above is ideal:
>> >>
>> >> 1. RDD: It is hard for Java developers to implement RDD, given the
>> >> implicit
>> >> class tags. In addition, the RDD API depends on Scala's runtime
>>library,
>> >> which does not preserve binary compatibility across Scala versions.
>>If a
>> >> developer chooses Java to implement an input source, it would be
>>great
>> if
>> >> that input source can be binary compatible in years to come.
>> >>
>> >> 2. Hadoop InputFormat: The Hadoop InputFormat API is overly
>>restrictive.
>> >> For example, it forces key-value semantics, and does not support
>>running
>> >> arbitrary code on the driver side (an example of why this is useful
>>is
>> >> broadcast). In addition, it is somewhat awkward to tell developers
>>that
>> in
>> >> order to implement an input source for Spark, they should learn the
>> Hadoop
>> >> MapReduce API first.
>> >>
>> >>
>> >> My patch creates a new InputSource interface, described by:
>> >>
>> >> - an array of InputPartition that specifies the data partitioning
>> >> - a RecordReader that specifies how data on each partition can be
>>read
>> >>
>> >> This interface is similar to Hadoop's InputFormat, except that there
>>is
>> no
>> >> explicit key/value separation.
>> >>
>> >>
>> >> JIRA ticket: 
>>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_ji
>>ra_browse_SPARK-2D7025&d=AwIBaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oO
>>nmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=WO8We1dUoSercyCHNAKk
>>tWH_nMrqD5TUhek8mTSCfFs&s=xUHYpQoU3NlV__I37IUkVwf94zzgAvtIj6N6uy2vwnc&e=
>> >> Pull request:
>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp
>>ark_pull_5603&d=AwIBaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=en
>>nQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=WO8We1dUoSercyCHNAKktWH_nMrqD
>>5TUhek8mTSCfFs&s=qoAlpURPOSkRgXxtlXHChqVHjm3yiPFgERk4LwKHLpg&e=
>> >>
>> >
>>


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] new Java friendly InputSource API

2015-04-23 Thread Reynold Xin
In the ctor of InputSource (I'm also considering adding an explicit
initialize call), the implementation of InputSource can execute arbitrary
code. The state in it will also be serialized and passed onto the executors.

Yes - technically you can hijack getSplits in Hadoop InputFormat to do the
same thing, and then put a reference of the state into every Split. But
that's kind of awkward. Hadoop relies on the giant Configuration object to
pass state over.



On Thu, Apr 23, 2015 at 11:02 AM, Mingyu Kim  wrote:

> Hi Reynold,
>
> You mentioned that the new API allows arbitrary code to be run on the
> driver side, but it¹s not very clear to me how this is different from what
> Hadoop API provides. In your example of using broadcast, did you mean
> broadcasting something in InputSource.getPartitions() and having
> InputPartitions use the broadcast variables? Isn¹t that already possible
> with Hadoop's InputFormat.getSplits()?
>
> Thanks,
> Mingyu
>
>
>
>
>
> On 4/21/15, 4:33 PM, "Soren Macbeth"  wrote:
>
> >I'm also super interested in this. Flambo (our clojure DSL) wraps the java
> >api and it would be great to have this.
> >
> >On Tue, Apr 21, 2015 at 4:10 PM, Reynold Xin  wrote:
> >
> >> It can reuse. That's a good point and we should document it in the API
> >> contract.
> >>
> >>
> >> On Tue, Apr 21, 2015 at 4:06 PM, Punyashloka Biswal <
> >> punya.bis...@gmail.com>
> >> wrote:
> >>
> >> > Reynold, thanks for this! At Palantir we're heavy users of the Java
> >>APIs
> >> > and appreciate being able to stop hacking around with fake ClassTags
> >>:)
> >> >
> >> > Regarding this specific proposal, is the contract of RecordReader#get
> >> > intended to be that it returns a fresh object each time? Or is it
> >>allowed
> >> > to mutate a fixed object and return a pointer to it each time?
> >> >
> >> > Put another way, is a caller supposed to clone the output of get() if
> >> they
> >> > want to use it later?
> >> >
> >> > Punya
> >> >
> >> > On Tue, Apr 21, 2015 at 4:35 PM Reynold Xin 
> >>wrote:
> >> >
> >> >> I created a pull request last night for a new InputSource API that is
> >> >> essentially a stripped down version of the RDD API for providing data
> >> into
> >> >> Spark. Would be great to hear the community's feedback.
> >> >>
> >> >> Spark currently has two de facto input source API:
> >> >> 1. RDD
> >> >> 2. Hadoop MapReduce InputFormat
> >> >>
> >> >> Neither of the above is ideal:
> >> >>
> >> >> 1. RDD: It is hard for Java developers to implement RDD, given the
> >> >> implicit
> >> >> class tags. In addition, the RDD API depends on Scala's runtime
> >>library,
> >> >> which does not preserve binary compatibility across Scala versions.
> >>If a
> >> >> developer chooses Java to implement an input source, it would be
> >>great
> >> if
> >> >> that input source can be binary compatible in years to come.
> >> >>
> >> >> 2. Hadoop InputFormat: The Hadoop InputFormat API is overly
> >>restrictive.
> >> >> For example, it forces key-value semantics, and does not support
> >>running
> >> >> arbitrary code on the driver side (an example of why this is useful
> >>is
> >> >> broadcast). In addition, it is somewhat awkward to tell developers
> >>that
> >> in
> >> >> order to implement an input source for Spark, they should learn the
> >> Hadoop
> >> >> MapReduce API first.
> >> >>
> >> >>
> >> >> My patch creates a new InputSource interface, described by:
> >> >>
> >> >> - an array of InputPartition that specifies the data partitioning
> >> >> - a RecordReader that specifies how data on each partition can be
> >>read
> >> >>
> >> >> This interface is similar to Hadoop's InputFormat, except that there
> >>is
> >> no
> >> >> explicit key/value separation.
> >> >>
> >> >>
> >> >> JIRA ticket:
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_ji
> >>ra_browse_SPARK-2D7025&d=AwIBaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oO
> >>nmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=WO8We1dUoSercyCHNAKk
> >>tWH_nMrqD5TUhek8mTSCfFs&s=xUHYpQoU3NlV__I37IUkVwf94zzgAvtIj6N6uy2vwnc&e=
> >> >> Pull request:
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp
> >>ark_pull_5603&d=AwIBaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=en
> >>nQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=WO8We1dUoSercyCHNAKktWH_nMrqD
> >>5TUhek8mTSCfFs&s=qoAlpURPOSkRgXxtlXHChqVHjm3yiPFgERk4LwKHLpg&e=
> >> >>
> >> >
> >>
>
>


Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread jimfcarroll

Okay.

PR: https://github.com/apache/spark/pull/5669

Jira: https://issues.apache.org/jira/browse/SPARK-7100

Hope that helps.

Let me know if you need anything else.

Jim




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-persisted-RDD-tp11750p11767.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
yep :) I'll open the jira when I've got the time.
Thanks

Le jeu. 23 avr. 2015 à 19:31, Reynold Xin  a écrit :

> Ah damn. We need to add it to the Python list. Would you like to give it a
> shot?
>
>
> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Yep no problem, but I can't seem to find the coalesce fonction in
>> pyspark.sql.{*, functions, types or whatever :) }
>>
>> Olivier.
>>
>> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> a écrit :
>>
>> > a UDF might be a good idea no ?
>> >
>> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>> > o.girar...@lateral-thoughts.com> a écrit :
>> >
>> >> Hi everyone,
>> >> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
>> API
>> >> in PySpark, is there any efficient alternative to mapping the records
>> >> myself ?
>> >>
>> >> Regards,
>> >>
>> >> Olivier.
>> >>
>> >
>>
>
>


Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
What is the way of testing/building the pyspark part of Spark ?

Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot <
o.girar...@lateral-thoughts.com> a écrit :

> yep :) I'll open the jira when I've got the time.
> Thanks
>
> Le jeu. 23 avr. 2015 à 19:31, Reynold Xin  a écrit :
>
>> Ah damn. We need to add it to the Python list. Would you like to give it
>> a shot?
>>
>>
>> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Yep no problem, but I can't seem to find the coalesce fonction in
>>> pyspark.sql.{*, functions, types or whatever :) }
>>>
>>> Olivier.
>>>
>>> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> a écrit :
>>>
>>> > a UDF might be a good idea no ?
>>> >
>>> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>>> > o.girar...@lateral-thoughts.com> a écrit :
>>> >
>>> >> Hi everyone,
>>> >> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
>>> API
>>> >> in PySpark, is there any efficient alternative to mapping the records
>>> >> myself ?
>>> >>
>>> >> Regards,
>>> >>
>>> >> Olivier.
>>> >>
>>> >
>>>
>>
>>


Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Reynold Xin
You need to first have the Spark assembly jar built with "sbt/sbt
assembly/assembly"

Then usually I go into python/run-tests and comment out the non-SQL tests:

#run_core_tests
run_sql_tests
#run_mllib_tests
#run_ml_tests
#run_streaming_tests

And then you can run "python/run-tests"




On Thu, Apr 23, 2015 at 1:17 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> What is the way of testing/building the pyspark part of Spark ?
>
> Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot <
> o.girar...@lateral-thoughts.com> a écrit :
>
>> yep :) I'll open the jira when I've got the time.
>> Thanks
>>
>> Le jeu. 23 avr. 2015 à 19:31, Reynold Xin  a écrit :
>>
>>> Ah damn. We need to add it to the Python list. Would you like to give it
>>> a shot?
>>>
>>>
>>> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> wrote:
>>>
 Yep no problem, but I can't seem to find the coalesce fonction in
 pyspark.sql.{*, functions, types or whatever :) }

 Olivier.

 Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
 o.girar...@lateral-thoughts.com> a écrit :

 > a UDF might be a good idea no ?
 >
 > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
 > o.girar...@lateral-thoughts.com> a écrit :
 >
 >> Hi everyone,
 >> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
 API
 >> in PySpark, is there any efficient alternative to mapping the records
 >> myself ?
 >>
 >> Regards,
 >>
 >> Olivier.
 >>
 >

>>>
>>>


Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
I found another way setting a SPARK_HOME on a released version and
launching an ipython to load the contexts.
I may need your insight however, I found why it hasn't been done at the
same time, this method (like some others) uses a varargs in Scala and for
now the way functions are called only one parameter is supported.

So at first I tried to just generalise the helper function "_" in the
functions.py file to multiple arguments, but py4j's handling of varargs
forces me to create an Array[Column] if the target method is expecting
varargs.

But from Python's perspective, we have no idea of whether the target method
will be expecting varargs or just multiple arguments (to un-tuple).
I can create a special case for "coalesce" or "for method that takes of
list of columns as arguments" considering they will be varargs based (and
therefore needs an Array[Column] instead of just a list of arguments)

But this seems very specific and very prone to future mistakes.
Is there any way in Py4j to know before calling it the signature of a
method ?


Le jeu. 23 avr. 2015 à 22:17, Olivier Girardot <
o.girar...@lateral-thoughts.com> a écrit :

> What is the way of testing/building the pyspark part of Spark ?
>
> Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot <
> o.girar...@lateral-thoughts.com> a écrit :
>
>> yep :) I'll open the jira when I've got the time.
>> Thanks
>>
>> Le jeu. 23 avr. 2015 à 19:31, Reynold Xin  a écrit :
>>
>>> Ah damn. We need to add it to the Python list. Would you like to give it
>>> a shot?
>>>
>>>
>>> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> wrote:
>>>
 Yep no problem, but I can't seem to find the coalesce fonction in
 pyspark.sql.{*, functions, types or whatever :) }

 Olivier.

 Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
 o.girar...@lateral-thoughts.com> a écrit :

 > a UDF might be a good idea no ?
 >
 > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
 > o.girar...@lateral-thoughts.com> a écrit :
 >
 >> Hi everyone,
 >> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
 API
 >> in PySpark, is there any efficient alternative to mapping the records
 >> myself ?
 >>
 >> Regards,
 >>
 >> Olivier.
 >>
 >

>>>
>>>


Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread Joseph Bradley
I saw the PR already, but only saw this just now.  I think both persists
are useful based on my experience, but it's very hard to say in general.

On Thu, Apr 23, 2015 at 12:22 PM, jimfcarroll  wrote:

>
> Okay.
>
> PR: https://github.com/apache/spark/pull/5669
>
> Jira: https://issues.apache.org/jira/browse/SPARK-7100
>
> Hope that helps.
>
> Let me know if you need anything else.
>
> Jim
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-persisted-RDD-tp11750p11767.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Reynold Xin
You can do it similar to the way countDistinct is done, can't you?

https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L78



On Thu, Apr 23, 2015 at 1:59 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> I found another way setting a SPARK_HOME on a released version and
> launching an ipython to load the contexts.
> I may need your insight however, I found why it hasn't been done at the
> same time, this method (like some others) uses a varargs in Scala and for
> now the way functions are called only one parameter is supported.
>
> So at first I tried to just generalise the helper function "_" in the
> functions.py file to multiple arguments, but py4j's handling of varargs
> forces me to create an Array[Column] if the target method is expecting
> varargs.
>
> But from Python's perspective, we have no idea of whether the target
> method will be expecting varargs or just multiple arguments (to un-tuple).
> I can create a special case for "coalesce" or "for method that takes of
> list of columns as arguments" considering they will be varargs based (and
> therefore needs an Array[Column] instead of just a list of arguments)
>
> But this seems very specific and very prone to future mistakes.
> Is there any way in Py4j to know before calling it the signature of a
> method ?
>
>
> Le jeu. 23 avr. 2015 à 22:17, Olivier Girardot <
> o.girar...@lateral-thoughts.com> a écrit :
>
>> What is the way of testing/building the pyspark part of Spark ?
>>
>> Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> a écrit :
>>
>>> yep :) I'll open the jira when I've got the time.
>>> Thanks
>>>
>>> Le jeu. 23 avr. 2015 à 19:31, Reynold Xin  a
>>> écrit :
>>>
 Ah damn. We need to add it to the Python list. Would you like to give
 it a shot?


 On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
 o.girar...@lateral-thoughts.com> wrote:

> Yep no problem, but I can't seem to find the coalesce fonction in
> pyspark.sql.{*, functions, types or whatever :) }
>
> Olivier.
>
> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
> o.girar...@lateral-thoughts.com> a écrit :
>
> > a UDF might be a good idea no ?
> >
> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
> > o.girar...@lateral-thoughts.com> a écrit :
> >
> >> Hi everyone,
> >> let's assume I'm stuck in 1.3.0, how can I benefit from the
> *fillna* API
> >> in PySpark, is there any efficient alternative to mapping the
> records
> >> myself ?
> >>
> >> Regards,
> >>
> >> Olivier.
> >>
> >
>




RE: Should we let everyone set Assignee?

2015-04-23 Thread Ulanov, Alexander
My thinking is that current way of assigning a contributor after the patch is 
done (or almost done) is OK. Parallel efforts are also OK until they are 
discussed in the issue's thread. Ilya Ganelin made a good point that it is 
about moving the project forward. It also adds means of competition "who make 
it faster/better" which is also good for the project and community. My only 
concern is about the throughput of Databricks folks who monitor issues, check 
patches and assign a contributor. Monitoring should be done on a constant basis 
(weekly?).

Best regards, Alexander

-Original Message-
From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com] 
Sent: Wednesday, April 22, 2015 2:59 PM
To: Patrick Wendell
Cc: Nicholas Chammas; Ganelin, Ilya; Mark Hamstra; Sean Owen; dev
Subject: Re: Should we let everyone set Assignee?


Last one for the day.

Everyone, as I said clearly, I was "not alluding to anything fishy in 
practice", I was describing how things go wrong in such an environment. Sandy's 
email lays down some of these problems.

Assigning a JIRA in other projects is not a reservation. It is a clear 
intention of working on design or code.

You don't need a new convention of signaling. In almost all other projects, it 
is assigning tickets - that's how it is used.

+Vinod

On Apr 22, 2015, at 2:37 PM, Patrick Wendell  wrote:

> Sandy - I definitely agree with that. We should have a convention of 
> signaling someone intends to work - for instance by commenting on the 
> JIRA and we should document this on the contribution guide. The nice 
> thing about having that convention is that multiple people can say 
> they are going to work on something, whereas only one person can be 
> given the assignee slot on a JIRA.
> 
> 
> On Wed, Apr 22, 2015 at 2:33 PM, Nicholas Chammas 
>  wrote:
>> To repeat what Patrick said (literally):
>> 
>> If an issue is "assigned" to person X, but some other person Y 
>> submits a great patch for it, I think we have some obligation to 
>> Spark users and to the community to merge the better patch. So the 
>> idea of reserving the right to add a feature, it just seems overall off to 
>> me.
>> 
>> No-one in the Spark community dictates who gets to do work. When an 
>> issue is assigned to someone in JIRA, it's either because a) they did 
>> the work and the issue is now resolved, or b) they are signaling to 
>> others that they are working on it.
>> 
>> In the case of b), nothing stops other people from working on the 
>> issue and it's quite normal for other people to complete issues that 
>> were technically assigned to someone else. There is no land grabbing 
>> or stalling. Anyone who has contributed to Spark for any amount of time 
>> knows this.
>> 
>> Vinod,
>> 
>> I want to take this opportunity to call out the approach to 
>> communication you took here.
>> 
>> As a random contributor to Spark and active participant on this list, 
>> my reaction when I read your email was this:
>> 
>> You do not know how the Spark community actually works.
>> You read a thread that contains some trigger phrases.
>> You wrote a lengthy response as a knee-jerk reaction.
>> 
>> I'm not trying to mock, but I want to be direct and honest about how 
>> you came off in this thread to me and probably many others.
>> 
>> Why not ask questions first--many questions? Why not make doubly sure 
>> that you understand the situation correctly before responding?
>> 
>> In many ways this is much like filing a bug report. "I'm seeing this. 
>> It seems wrong to me. Is this expected?" I think we all know from 
>> experience that this kind of bug report is polite and will likely 
>> lead to a productive discussion. On the other hand: "You're returning 
>> a -1 here? This is obviously wrong! And, boy, lemme tell you how 
>> wrong you are!!!" No-one likes to deal with bug reports like this. 
>> More importantly, they get in the way of fixing the actual problem, if there 
>> is one.
>> 
>> This is not about the Apache Way or not. It's about basic etiquette 
>> and effective communication.
>> 
>> I understand that there are legitimate potential concerns here, and 
>> it's important that, as an Apache project, Spark work according to 
>> Apache principles. But when some person who has never participated on 
>> this list pops up out of nowhere with a lengthy lecture on the Apache 
>> Way and whatnot, I have to say that that is not an effective way to 
>> communicate. Pretty much the same thing happened with Greg Stein on 
>> an earlier thread some months ago about designating maintainers for 
>> components.
>> 
>> The concerns are legitimate, I'm sure, and we want to keep Spark in 
>> line with the Apache Way. And certainly, there have been many times 
>> when a project veered off course and needed to corrected.
>> 
>> But when we want to make things right, I hope we can do it in a way 
>> that respectfully and tactfully engages the community. These 
>> "lectures delivered from above" -- which 

RE: Should we let everyone set Assignee?

2015-04-23 Thread Sean Owen
The merge script automatically updates the linked JIRA after merging the PR
(why it is important to put the JIRA in the title). It can't auto assign
the JIRA since usernames dont match up but it is an easy reminder to set
the Assignee. I do right after and I think other committers do too.

I'll search later for Fixed and Unassigned JIRAs in case there are any.
Feel free to flag any.

In practice I think it is pretty rare that 2 people work on one JIRA
accidentally and can't remember a case where there was disagreement about
how to proceed. So I dont think a 'lock' is necessary in practice and don't
think even signaling has been a problem.
On Apr 23, 2015 6:14 PM, "Ulanov, Alexander" 
wrote:

> My thinking is that current way of assigning a contributor after the patch
> is done (or almost done) is OK. Parallel efforts are also OK until they are
> discussed in the issue's thread. Ilya Ganelin made a good point that it is
> about moving the project forward. It also adds means of competition "who
> make it faster/better" which is also good for the project and community. My
> only concern is about the throughput of Databricks folks who monitor
> issues, check patches and assign a contributor. Monitoring should be done
> on a constant basis (weekly?).
>
> Best regards, Alexander
>
> -Original Message-
> From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com]
> Sent: Wednesday, April 22, 2015 2:59 PM
> To: Patrick Wendell
> Cc: Nicholas Chammas; Ganelin, Ilya; Mark Hamstra; Sean Owen; dev
> Subject: Re: Should we let everyone set Assignee?
>
>
> Last one for the day.
>
> Everyone, as I said clearly, I was "not alluding to anything fishy in
> practice", I was describing how things go wrong in such an environment.
> Sandy's email lays down some of these problems.
>
> Assigning a JIRA in other projects is not a reservation. It is a clear
> intention of working on design or code.
>
> You don't need a new convention of signaling. In almost all other
> projects, it is assigning tickets - that's how it is used.
>
> +Vinod
>
> On Apr 22, 2015, at 2:37 PM, Patrick Wendell  wrote:
>
> > Sandy - I definitely agree with that. We should have a convention of
> > signaling someone intends to work - for instance by commenting on the
> > JIRA and we should document this on the contribution guide. The nice
> > thing about having that convention is that multiple people can say
> > they are going to work on something, whereas only one person can be
> > given the assignee slot on a JIRA.
> >
> >
> > On Wed, Apr 22, 2015 at 2:33 PM, Nicholas Chammas
> >  wrote:
> >> To repeat what Patrick said (literally):
> >>
> >> If an issue is "assigned" to person X, but some other person Y
> >> submits a great patch for it, I think we have some obligation to
> >> Spark users and to the community to merge the better patch. So the
> >> idea of reserving the right to add a feature, it just seems overall off
> to me.
> >>
> >> No-one in the Spark community dictates who gets to do work. When an
> >> issue is assigned to someone in JIRA, it's either because a) they did
> >> the work and the issue is now resolved, or b) they are signaling to
> >> others that they are working on it.
> >>
> >> In the case of b), nothing stops other people from working on the
> >> issue and it's quite normal for other people to complete issues that
> >> were technically assigned to someone else. There is no land grabbing
> >> or stalling. Anyone who has contributed to Spark for any amount of time
> knows this.
> >>
> >> Vinod,
> >>
> >> I want to take this opportunity to call out the approach to
> >> communication you took here.
> >>
> >> As a random contributor to Spark and active participant on this list,
> >> my reaction when I read your email was this:
> >>
> >> You do not know how the Spark community actually works.
> >> You read a thread that contains some trigger phrases.
> >> You wrote a lengthy response as a knee-jerk reaction.
> >>
> >> I'm not trying to mock, but I want to be direct and honest about how
> >> you came off in this thread to me and probably many others.
> >>
> >> Why not ask questions first--many questions? Why not make doubly sure
> >> that you understand the situation correctly before responding?
> >>
> >> In many ways this is much like filing a bug report. "I'm seeing this.
> >> It seems wrong to me. Is this expected?" I think we all know from
> >> experience that this kind of bug report is polite and will likely
> >> lead to a productive discussion. On the other hand: "You're returning
> >> a -1 here? This is obviously wrong! And, boy, lemme tell you how
> >> wrong you are!!!" No-one likes to deal with bug reports like this.
> >> More importantly, they get in the way of fixing the actual problem, if
> there is one.
> >>
> >> This is not about the Apache Way or not. It's about basic etiquette
> >> and effective communication.
> >>
> >> I understand that there are legitimate potential concerns here, and
> >> it's important that, as

Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Sean Owen
Following my comment earlier that "I think we set Assignee for Fixed
JIRAs consistently", I found there are actually 880 counter examples.
Lots of them are old, and I'll try to fix as many that are recent (for
the 1.4.0 release credits) as I can stand to click through.

Let's set Assignee after resolving consistently though. In various
ways I've heard that people do really like the bit of credit, and I
don't think anybody disputed setting Assignee *after* it was resolved
as a way of giving credit.

People who know they're missing a credit are welcome to ping me
directly to get it fixed.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Shivaram Venkataraman
A related question that has affected me in the past: If we get a PR from a
new developer I sometimes find that I am not able to assign an issue to
them after merging the PR. Is there a process we need follow to get new
contributors on to a particular group in JIRA ? Or does it somehow happen
automatically ?

Thanks
Shivaram

On Thu, Apr 23, 2015 at 5:26 PM, Sean Owen  wrote:

> Following my comment earlier that "I think we set Assignee for Fixed
> JIRAs consistently", I found there are actually 880 counter examples.
> Lots of them are old, and I'll try to fix as many that are recent (for
> the 1.4.0 release credits) as I can stand to click through.
>
> Let's set Assignee after resolving consistently though. In various
> ways I've heard that people do really like the bit of credit, and I
> don't think anybody disputed setting Assignee *after* it was resolved
> as a way of giving credit.
>
> People who know they're missing a credit are welcome to ping me
> directly to get it fixed.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Hari Shreedharan
You’d need to add them as a contributor in the JIRA admin page. Once you do 
that, you should be able to assign the jira to that person




Thanks, Hari

On Thu, Apr 23, 2015 at 5:33 PM, Shivaram Venkataraman
 wrote:

> A related question that has affected me in the past: If we get a PR from a
> new developer I sometimes find that I am not able to assign an issue to
> them after merging the PR. Is there a process we need follow to get new
> contributors on to a particular group in JIRA ? Or does it somehow happen
> automatically ?
> Thanks
> Shivaram
> On Thu, Apr 23, 2015 at 5:26 PM, Sean Owen  wrote:
>> Following my comment earlier that "I think we set Assignee for Fixed
>> JIRAs consistently", I found there are actually 880 counter examples.
>> Lots of them are old, and I'll try to fix as many that are recent (for
>> the 1.4.0 release credits) as I can stand to click through.
>>
>> Let's set Assignee after resolving consistently though. In various
>> ways I've heard that people do really like the bit of credit, and I
>> don't think anybody disputed setting Assignee *after* it was resolved
>> as a way of giving credit.
>>
>> People who know they're missing a credit are welcome to ping me
>> directly to get it fixed.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Luciano Resende
On Thu, Apr 23, 2015 at 5:47 PM, Hari Shreedharan  wrote:

> You’d need to add them as a contributor in the JIRA admin page. Once you
> do that, you should be able to assign the jira to that person
>
>
>
Is this documented, and does every PMC (or committer) have access to do
that ?


>
>
> Thanks, Hari
>
> On Thu, Apr 23, 2015 at 5:33 PM, Shivaram Venkataraman
>  wrote:
>
> > A related question that has affected me in the past: If we get a PR from
> a
> > new developer I sometimes find that I am not able to assign an issue to
> > them after merging the PR. Is there a process we need follow to get new
> > contributors on to a particular group in JIRA ? Or does it somehow happen
> > automatically ?
> > Thanks
> > Shivaram
> > On Thu, Apr 23, 2015 at 5:26 PM, Sean Owen  wrote:
> >> Following my comment earlier that "I think we set Assignee for Fixed
> >> JIRAs consistently", I found there are actually 880 counter examples.
> >> Lots of them are old, and I'll try to fix as many that are recent (for
> >> the 1.4.0 release credits) as I can stand to click through.
> >>
> >> Let's set Assignee after resolving consistently though. In various
> >> ways I've heard that people do really like the bit of credit, and I
> >> don't think anybody disputed setting Assignee *after* it was resolved
> >> as a way of giving credit.
> >>
> >> People who know they're missing a credit are welcome to ping me
> >> directly to get it fixed.
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >>
>



-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Luciano Resende
On Thu, Apr 23, 2015 at 5:26 PM, Sean Owen  wrote:

> Following my comment earlier that "I think we set Assignee for Fixed
> JIRAs consistently", I found there are actually 880 counter examples.
> Lots of them are old, and I'll try to fix as many that are recent (for
> the 1.4.0 release credits) as I can stand to click through.
>
> Let's set Assignee after resolving consistently though. In various
> ways I've heard that people do really like the bit of credit, and I
> don't think anybody disputed setting Assignee *after* it was resolved
> as a way of giving credit.
>
> People who know they're missing a credit are welcome to ping me
> directly to get it fixed.
>
>
>
+1, this will help with giving people the "bit of credit", but I guess it
also helps on recognizing the community contributors towards becoming
committers much easier.



-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Sean Owen
Permission to change project roles is restricted to admins / PMC,
naturally. I don't know if it's documented beyond this that Gavin
helpfully pasted:
https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme

Another option is to make the list of people who you can assign to
include "Anyone". You'd get all of Apache in the list, so I assume
that's why it isn't the default, but might be the easier thing at this
point in Spark. I had to add 6 new people just now while assigning ~40
JIRAs.

On Thu, Apr 23, 2015 at 9:08 PM, Luciano Resende  wrote:
>
> On Thu, Apr 23, 2015 at 5:47 PM, Hari Shreedharan
>  wrote:
>>
>> You’d need to add them as a contributor in the JIRA admin page. Once you
>> do that, you should be able to assign the jira to that person
>>
>>
>
> Is this documented, and does every PMC (or committer) have access to do that
> ?
>
>>
>>
>>
>> Thanks, Hari
>>
>> On Thu, Apr 23, 2015 at 5:33 PM, Shivaram Venkataraman
>>  wrote:
>>
>> > A related question that has affected me in the past: If we get a PR from
>> > a
>> > new developer I sometimes find that I am not able to assign an issue to
>> > them after merging the PR. Is there a process we need follow to get new
>> > contributors on to a particular group in JIRA ? Or does it somehow
>> > happen
>> > automatically ?
>> > Thanks
>> > Shivaram
>> > On Thu, Apr 23, 2015 at 5:26 PM, Sean Owen  wrote:
>> >> Following my comment earlier that "I think we set Assignee for Fixed
>> >> JIRAs consistently", I found there are actually 880 counter examples.
>> >> Lots of them are old, and I'll try to fix as many that are recent (for
>> >> the 1.4.0 release credits) as I can stand to click through.
>> >>
>> >> Let's set Assignee after resolving consistently though. In various
>> >> ways I've heard that people do really like the bit of credit, and I
>> >> don't think anybody disputed setting Assignee *after* it was resolved
>> >> as a way of giving credit.
>> >>
>> >> People who know they're missing a credit are welcome to ping me
>> >> directly to get it fixed.
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>
>> >>
>
>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



First-class support for pip/virtualenv in pyspark

2015-04-23 Thread Justin Uang
Hi,

I have been trying to figure out how to ship a python package that I have
been working on, and this has brought up a couple questions to me. Please
note that I'm fairly new to python package management, so any
feedback/corrections is welcome =)

It looks like the --py-files support we have merely adds the .py, .zip, or
.egg to the sys.path, and therefore only supports "built" distributions
that only needed to be added to the path. Because of this, it looks like
wheels won't work as well, since they involve an installation process (
https://www.python.org/dev/peps/pep-0427/#is-it-possible-to-import-python-code-directly-from-a-wheel-file
).

In addition, any type of distribution that has shared libraries, such as
pandas and numpy wheels will fail because "ZIP import of dynamic modules
(.pyd, .so) is disallowed" (https://docs.python.org/2/library/zipimport.html
).

The only way to support wheels or other types of source distributions that
require an "installation" step, is to use an installer like pip, in which
case, the natural extension is to use virtualenv. Have we considered having
pyspark manage virtualenvs, and to use pip install to install packages that
are sent across the cluster? I feel like first class support of using pip
install will

- allow us to ship packages that require an install step (numpy, pandas,
etc)
- help users not have to provision the cluster with all the dependencies
- allow multiple applications run with different environments at the same
time
- allow a user just to specify a top level dependency or requirements.txt,
and have pip install all the transitive dependencies automatically

Thanks!

Justin


Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-23 Thread Sourav Chandra
*bump*

On Thu, Apr 23, 2015 at 3:46 PM, Sourav Chandra <
sourav.chan...@livestream.com> wrote:

> HI TD,
>
> Some observations:
>
> 1. If I submit the application using spark-submit tool with *client as
> deploy mode* it works fine with single master and worker (driver, master
> and worker are running in same machine)
> 2. If I submit the application using spark-submit tool with client as
> deploy mode it *crashes after some time with  StackOverflowError* *single
> master and 2 workers* (driver, master and 1 worker is running in same
> machine, other
> worker is in different machine)
>  *15/04/23 05:42:04 Executor: Exception in task 0.0 in stage 23153.0
> (TID 5412)*
> *java.lang.StackOverflowError*
> *at
> java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2864)*
> *at java.io.ObjectInputStream.readUTF(ObjectInputStream.java:1072)*
> *at
> java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:671)*
> *at
> java.io.ObjectInputStream.readClassDescriptor(ObjectInputStream.java:830)*
> *at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1601)*
> *at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)*
> *at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)*
> *at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
> *at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
> *at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
> *at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
> *at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
> *at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
> *at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
> *at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
> *at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
> *at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)*
> *at
> scala.collection.immutable.$colon$colon.readObject(List.scala:362)*
> *at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)*
> *at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
> *at java.lang.reflect.Method.invoke(Method.java:606)*
> *at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)*
> *at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)*
> *at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
> *at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
> *at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
> *at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
> *at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
> *at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
> *at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
> *at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
> *at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
> *at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
> *at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)*
> *at
> scala.collection.immutable.$colon$colon.readObject(List.scala:362)*
> *at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)*
> *at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
> *at java.lang.reflect.Method.invoke(Method.java:606)*
> *at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)*
> *at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)*
> *at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
> *at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
> *at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
> *at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
> *at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
> *at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
> *at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
> *at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
> *at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
> *at
> java.io.ObjectInp

Contributing Documentation Changes

2015-04-23 Thread madhu phatak
Hi,
 As I was reading contributing to Spark wiki, it was mentioned that we can
contribute external links to spark tutorials. I have written many
 of them in my blog. It
will be great if someone can add it to the spark website.



Regards,
Madhukara Phatak
http://datamantra.io/


Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
I'll try thanks

Le ven. 24 avr. 2015 à 00:09, Reynold Xin  a écrit :

> You can do it similar to the way countDistinct is done, can't you?
>
>
> https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L78
>
>
>
> On Thu, Apr 23, 2015 at 1:59 PM, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> I found another way setting a SPARK_HOME on a released version and
>> launching an ipython to load the contexts.
>> I may need your insight however, I found why it hasn't been done at the
>> same time, this method (like some others) uses a varargs in Scala and for
>> now the way functions are called only one parameter is supported.
>>
>> So at first I tried to just generalise the helper function "_" in the
>> functions.py file to multiple arguments, but py4j's handling of varargs
>> forces me to create an Array[Column] if the target method is expecting
>> varargs.
>>
>> But from Python's perspective, we have no idea of whether the target
>> method will be expecting varargs or just multiple arguments (to un-tuple).
>> I can create a special case for "coalesce" or "for method that takes of
>> list of columns as arguments" considering they will be varargs based (and
>> therefore needs an Array[Column] instead of just a list of arguments)
>>
>> But this seems very specific and very prone to future mistakes.
>> Is there any way in Py4j to know before calling it the signature of a
>> method ?
>>
>>
>> Le jeu. 23 avr. 2015 à 22:17, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> a écrit :
>>
>>> What is the way of testing/building the pyspark part of Spark ?
>>>
>>> Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> a écrit :
>>>
 yep :) I'll open the jira when I've got the time.
 Thanks

 Le jeu. 23 avr. 2015 à 19:31, Reynold Xin  a
 écrit :

> Ah damn. We need to add it to the Python list. Would you like to give
> it a shot?
>
>
> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Yep no problem, but I can't seem to find the coalesce fonction in
>> pyspark.sql.{*, functions, types or whatever :) }
>>
>> Olivier.
>>
>> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> a écrit :
>>
>> > a UDF might be a good idea no ?
>> >
>> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>> > o.girar...@lateral-thoughts.com> a écrit :
>> >
>> >> Hi everyone,
>> >> let's assume I'm stuck in 1.3.0, how can I benefit from the
>> *fillna* API
>> >> in PySpark, is there any efficient alternative to mapping the
>> records
>> >> myself ?
>> >>
>> >> Regards,
>> >>
>> >> Olivier.
>> >>
>> >
>>
>
>
>