Re: [Important for PySpark Devs]: Master now tests with Python 2.7 rather than 2.6 - please retest any Python PRs

2017-03-29 Thread Hyukjin Kwon
Thank you for informing this. On 30 Mar 2017 3:52 a.m., "Holden Karau" wrote: > Hi PySpark Developers, > > In https://issues.apache.org/jira/browse/SPARK-19955 / > https://github.com/apache/spark/pull/17355, as part of our continued > Python 2.6 deprecation https://issues.apache.org/jira/browse/

Re: Why is shuffle data always persisted to disk?

2017-03-29 Thread Kay Ousterhout
There's been some discussion of this on this JIRA and the associated PR. The short summary is that, in theory / to a VERY rough approximation, the OS buffer cache does everything we'd want an in-memory shuffle to do, and is simple. On Wed, Mar 29,

[Important for PySpark Devs]: Master now tests with Python 2.7 rather than 2.6 - please retest any Python PRs

2017-03-29 Thread Holden Karau
Hi PySpark Developers, In https://issues.apache.org/jira/browse/SPARK-19955 / https://github.com/apache/spark/pull/17355, as part of our continued Python 2.6 deprecation https://issues.apache.org/jira/browse/SPARK-15902 & eventual removal https://issues.apache.org/jira/browse/SPARK-12661 , Jenkins

Re: planning & discussion for larger scheduler changes

2017-03-29 Thread Imran Rashid
Thanks for the responses all. I may have worded my original email poorly -- I don't want to focus too much on SPARK-14649 and SPARK-13669 in particular, but more on how we should be approaching these changes. On Mon, Mar 27, 2017 at 9:01 PM, Kay Ousterhout wrote: > (1) I'm pretty hesitant to me

Catalyst: unary or binary expressions that are not UnaryExpressions or BinaryExpressions? Why?

2017-03-29 Thread Jacek Laskowski
Hi, While reviewing available expressions in Catalyst, I've come across few places (AggregateExpression or WindowExpression) that are unary or binary expressions but they inherit directly from Expression that makes my comprehension slightly harder (esp. that I can't stop thinking about the reason

Why is shuffle data always persisted to disk?

2017-03-29 Thread Effi Ofer
Greetings, I was wondering why Spark's Shuffler always persists the shuffle data to disk? I understand that the persisted data can be used by the scheduler to truncate the lineage of the RDD graph if an existing RDD has been materialized as a side effect of an earlier shuffle. But that does not e