Re: Spark Improvement Proposals

2017-03-07 Thread Cody Koeninger
Another week, another ping.  Anyone on the PMC willing to call a vote on
this?

On Mon, Feb 27, 2017 at 3:08 PM, Ryan Blue  wrote:

> I'd like to see more discussion on the issues I raised. I don't think
> there was a response for why voting is limited to PMC members.
>
> Tim was kind enough to reply with his rationale for a shepherd, but I
> don't think that it justifies failing proposals. I think it boiled down to
> "shepherds can be helpful", which isn't a good reason to require them in my
> opinion. Sam also had some good comments on this and I think that there's
> more to talk about.
>
> That said, I'd rather not have this proposal fail because we're tired of
> talking about it. If most people are okay with it as it stands and want a
> vote, I'm fine testing this out and fixing it later.
>
> rb
>
> On Fri, Feb 24, 2017 at 8:28 PM, Joseph Bradley 
> wrote:
>
>> The current draft LGTM.  I agree some of the various concerns may need to
>> be addressed in the future, depending on how SPIPs progress in practice.
>> If others agree, let's put it to a vote and revisit the proposal in a few
>> months.
>> Joseph
>>
>> On Fri, Feb 24, 2017 at 5:35 AM, Cody Koeninger 
>> wrote:
>>
>>> It's been a week since any further discussion.
>>>
>>> Do PMC members think the current draft is OK to vote on?
>>>
>>> On Fri, Feb 17, 2017 at 10:41 PM, vaquar khan 
>>> wrote:
>>> > I like document and happy to see SPIP draft version however i feel
>>> shepherd
>>> > role is again hurdle in process improvement ,It's like everything
>>> depends
>>> > only on shepherd .
>>> >
>>> > Also want to add point that SPIP  should be time bound with define SLA
>>> else
>>> > will defeats purpose.
>>> >
>>> >
>>> > Regards,
>>> > Vaquar khan
>>> >
>>> > On Thu, Feb 16, 2017 at 3:26 PM, Ryan Blue 
>>> > wrote:
>>> >>
>>> >> > [The shepherd] can advise on technical and procedural
>>> considerations for
>>> >> > people outside the community
>>> >>
>>> >> The sentiment is good, but this doesn't justify requiring a shepherd
>>> for a
>>> >> proposal. There are plenty of people that wouldn't need this, would
>>> get
>>> >> feedback during discussion, or would ask a committer or PMC member if
>>> it
>>> >> weren't a formal requirement.
>>> >>
>>> >> > if no one is willing to be a shepherd, the proposed idea is
>>> probably not
>>> >> > going to receive much traction in the first place.
>>> >>
>>> >> This also doesn't sound like a reason for needing a shepherd. Saying
>>> that
>>> >> a shepherd probably won't hurt the process doesn't give me an idea of
>>> why a
>>> >> shepherd should be required in the first place.
>>> >>
>>> >> What was the motivation for adding a shepherd originally? It may not
>>> be
>>> >> bad and it could be helpful, but neither of those makes me think that
>>> they
>>> >> should be required or else the proposal fails.
>>> >>
>>> >> rb
>>> >>
>>> >> On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter <
>>> timhun...@databricks.com>
>>> >> wrote:
>>> >>>
>>> >>> The doc looks good to me.
>>> >>>
>>> >>> Ryan, the role of the shepherd is to make sure that someone
>>> >>> knowledgeable with Spark processes is involved: this person can
>>> advise
>>> >>> on technical and procedural considerations for people outside the
>>> >>> community. Also, if no one is willing to be a shepherd, the proposed
>>> >>> idea is probably not going to receive much traction in the first
>>> >>> place.
>>> >>>
>>> >>> Tim
>>> >>>
>>> >>> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger 
>>> >>> wrote:
>>> >>> > Reynold, thanks, LGTM.
>>> >>> >
>>> >>> > Sean, great concerns.  I agree that behavior is largely cultural
>>> and
>>> >>> > writing down a process won't necessarily solve any problems one
>>> way or
>>> >>> > the other.  But one outwardly visible change I'm hoping for out of
>>> >>> > this a way for people who have a stake in Spark, but can't follow
>>> >>> > jiras closely, to go to the Spark website, see the list of proposed
>>> >>> > major changes, contribute discussion on issues that are relevant to
>>> >>> > their needs, and see a clear direction once a vote has passed.  We
>>> >>> > don't have that now.
>>> >>> >
>>> >>> > Ryan, realistically speaking any PMC member can and will stop any
>>> >>> > changes they don't like anyway, so might as well be up front about
>>> the
>>> >>> > reality of the situation.
>>> >>> >
>>> >>> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen 
>>> wrote:
>>> >>> >> The text seems fine to me. Really, this is not describing a
>>> >>> >> fundamentally
>>> >>> >> new process, which is good. We've always had JIRAs, we've always
>>> been
>>> >>> >> able
>>> >>> >> to call a VOTE for a big question. This just writes down a
>>> sensible
>>> >>> >> set of
>>> >>> >> guidelines for putting those two together when a major change is
>>> >>> >> proposed. I
>>> >>> >> look forward to turning some big JIRAs into a request for a SPIP.
>>> >>> >>
>>> >>> >> My only hesitation is that this seems to be perceived by some as
>>> a new
>>> >

Re: Spark Improvement Proposals

2017-03-07 Thread Sean Owen
Do we need a VOTE? heck I think anyone can call one, anyway.

Pre-flight vote check: anyone have objections to the text as-is?
See
https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#

If so let's hash out specific suggest changes.

If not, then I think the next step is to probably update the
github.com/apache/spark-website repo with the text here. That's a code/doc
change we can just review and merge as usual.

On Tue, Mar 7, 2017 at 3:15 PM Cody Koeninger  wrote:

> Another week, another ping.  Anyone on the PMC willing to call a vote on
> this?
>


Issues: Generate JSON with null values in Spark 2.0.x

2017-03-07 Thread Chetan Khatri
Hello Dev / Users,

I am working with PySpark Code migration to scala, with Python - Iterating
Spark with dictionary and generating JSON with null is possible with
json.dumps() which will be converted to SparkSQL[Row] but in scala how can
we generate json will null values as a Dataframe ?

Thanks.


[SQL]Analysis failed when combining Window function and GROUP BY in Spark2.x

2017-03-07 Thread StanZhai
We can reproduce this using the following code:
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()

val sql1 =
  """
|create temporary view tb as select * from values
|(1, 0),
|(1, 0),
|(2, 0)
|as grouping(a, b)
  """.stripMargin

val sql =
  """
|select count(distinct(b)) over (partition by a) from tb group by a
  """.stripMargin

spark.sql(sql1)
spark.sql(sql).show()It will throw exception like this:
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
'tb.`b`' is neither present in the group by, nor is it an aggregate function. 
Add to group by or wrap in first() (or first_value) if you don't care which 
value you get.;;
Project [count(DISTINCT b) OVER (PARTITION BY a ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING)#4L]
+- Project [b#1, a#0, count(DISTINCT b) OVER (PARTITION BY a ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L, count(DISTINCT b) OVER 
(PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L]
   +- Window [count(distinct b#1) windowspecdefinition(a#0, ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS count(DISTINCT b) OVER 
(PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L], 
[a#0]
  +- Aggregate [a#0], [b#1, a#0]
 +- SubqueryAlias tb
+- Project [a#0, b#1]
   +- SubqueryAlias grouping
  +- LocalRelation [a#0, b#1]

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:220)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$7.apply(CheckAnalysis.scala:247)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$7.apply(CheckAnalysis.scala:247)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:247)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)But, there is 
no exception in Spark 1.6.x.
I think the sql select count(distinct(b)) over (partition by a) from tb group 
by a should be executed. I've no idea about the exception. Is this in line with 
expectations?
Any help is appreciated!
Best, 
Stan







--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Analysis-failed-when-combining-Window-function-and-GROUP-BY-in-Spark2-x-tp21131.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.