Re: Spark Improvement Proposals
Another week, another ping. Anyone on the PMC willing to call a vote on this? On Mon, Feb 27, 2017 at 3:08 PM, Ryan Blue wrote: > I'd like to see more discussion on the issues I raised. I don't think > there was a response for why voting is limited to PMC members. > > Tim was kind enough to reply with his rationale for a shepherd, but I > don't think that it justifies failing proposals. I think it boiled down to > "shepherds can be helpful", which isn't a good reason to require them in my > opinion. Sam also had some good comments on this and I think that there's > more to talk about. > > That said, I'd rather not have this proposal fail because we're tired of > talking about it. If most people are okay with it as it stands and want a > vote, I'm fine testing this out and fixing it later. > > rb > > On Fri, Feb 24, 2017 at 8:28 PM, Joseph Bradley > wrote: > >> The current draft LGTM. I agree some of the various concerns may need to >> be addressed in the future, depending on how SPIPs progress in practice. >> If others agree, let's put it to a vote and revisit the proposal in a few >> months. >> Joseph >> >> On Fri, Feb 24, 2017 at 5:35 AM, Cody Koeninger >> wrote: >> >>> It's been a week since any further discussion. >>> >>> Do PMC members think the current draft is OK to vote on? >>> >>> On Fri, Feb 17, 2017 at 10:41 PM, vaquar khan >>> wrote: >>> > I like document and happy to see SPIP draft version however i feel >>> shepherd >>> > role is again hurdle in process improvement ,It's like everything >>> depends >>> > only on shepherd . >>> > >>> > Also want to add point that SPIP should be time bound with define SLA >>> else >>> > will defeats purpose. >>> > >>> > >>> > Regards, >>> > Vaquar khan >>> > >>> > On Thu, Feb 16, 2017 at 3:26 PM, Ryan Blue >>> > wrote: >>> >> >>> >> > [The shepherd] can advise on technical and procedural >>> considerations for >>> >> > people outside the community >>> >> >>> >> The sentiment is good, but this doesn't justify requiring a shepherd >>> for a >>> >> proposal. There are plenty of people that wouldn't need this, would >>> get >>> >> feedback during discussion, or would ask a committer or PMC member if >>> it >>> >> weren't a formal requirement. >>> >> >>> >> > if no one is willing to be a shepherd, the proposed idea is >>> probably not >>> >> > going to receive much traction in the first place. >>> >> >>> >> This also doesn't sound like a reason for needing a shepherd. Saying >>> that >>> >> a shepherd probably won't hurt the process doesn't give me an idea of >>> why a >>> >> shepherd should be required in the first place. >>> >> >>> >> What was the motivation for adding a shepherd originally? It may not >>> be >>> >> bad and it could be helpful, but neither of those makes me think that >>> they >>> >> should be required or else the proposal fails. >>> >> >>> >> rb >>> >> >>> >> On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter < >>> timhun...@databricks.com> >>> >> wrote: >>> >>> >>> >>> The doc looks good to me. >>> >>> >>> >>> Ryan, the role of the shepherd is to make sure that someone >>> >>> knowledgeable with Spark processes is involved: this person can >>> advise >>> >>> on technical and procedural considerations for people outside the >>> >>> community. Also, if no one is willing to be a shepherd, the proposed >>> >>> idea is probably not going to receive much traction in the first >>> >>> place. >>> >>> >>> >>> Tim >>> >>> >>> >>> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger >>> >>> wrote: >>> >>> > Reynold, thanks, LGTM. >>> >>> > >>> >>> > Sean, great concerns. I agree that behavior is largely cultural >>> and >>> >>> > writing down a process won't necessarily solve any problems one >>> way or >>> >>> > the other. But one outwardly visible change I'm hoping for out of >>> >>> > this a way for people who have a stake in Spark, but can't follow >>> >>> > jiras closely, to go to the Spark website, see the list of proposed >>> >>> > major changes, contribute discussion on issues that are relevant to >>> >>> > their needs, and see a clear direction once a vote has passed. We >>> >>> > don't have that now. >>> >>> > >>> >>> > Ryan, realistically speaking any PMC member can and will stop any >>> >>> > changes they don't like anyway, so might as well be up front about >>> the >>> >>> > reality of the situation. >>> >>> > >>> >>> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen >>> wrote: >>> >>> >> The text seems fine to me. Really, this is not describing a >>> >>> >> fundamentally >>> >>> >> new process, which is good. We've always had JIRAs, we've always >>> been >>> >>> >> able >>> >>> >> to call a VOTE for a big question. This just writes down a >>> sensible >>> >>> >> set of >>> >>> >> guidelines for putting those two together when a major change is >>> >>> >> proposed. I >>> >>> >> look forward to turning some big JIRAs into a request for a SPIP. >>> >>> >> >>> >>> >> My only hesitation is that this seems to be perceived by some as >>> a new >>> >
Re: Spark Improvement Proposals
Do we need a VOTE? heck I think anyone can call one, anyway. Pre-flight vote check: anyone have objections to the text as-is? See https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit# If so let's hash out specific suggest changes. If not, then I think the next step is to probably update the github.com/apache/spark-website repo with the text here. That's a code/doc change we can just review and merge as usual. On Tue, Mar 7, 2017 at 3:15 PM Cody Koeninger wrote: > Another week, another ping. Anyone on the PMC willing to call a vote on > this? >
Issues: Generate JSON with null values in Spark 2.0.x
Hello Dev / Users, I am working with PySpark Code migration to scala, with Python - Iterating Spark with dictionary and generating JSON with null is possible with json.dumps() which will be converted to SparkSQL[Row] but in scala how can we generate json will null values as a Dataframe ? Thanks.
[SQL]Analysis failed when combining Window function and GROUP BY in Spark2.x
We can reproduce this using the following code: val spark = SparkSession.builder().appName("test").master("local").getOrCreate() val sql1 = """ |create temporary view tb as select * from values |(1, 0), |(1, 0), |(2, 0) |as grouping(a, b) """.stripMargin val sql = """ |select count(distinct(b)) over (partition by a) from tb group by a """.stripMargin spark.sql(sql1) spark.sql(sql).show()It will throw exception like this: Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'tb.`b`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Project [count(DISTINCT b) OVER (PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L] +- Project [b#1, a#0, count(DISTINCT b) OVER (PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L, count(DISTINCT b) OVER (PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L] +- Window [count(distinct b#1) windowspecdefinition(a#0, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS count(DISTINCT b) OVER (PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L], [a#0] +- Aggregate [a#0], [b#1, a#0] +- SubqueryAlias tb +- Project [a#0, b#1] +- SubqueryAlias grouping +- LocalRelation [a#0, b#1] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:220) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$7.apply(CheckAnalysis.scala:247) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$7.apply(CheckAnalysis.scala:247) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:247) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)But, there is no exception in Spark 1.6.x. I think the sql select count(distinct(b)) over (partition by a) from tb group by a should be executed. I've no idea about the exception. Is this in line with expectations? Any help is appreciated! Best, Stan -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Analysis-failed-when-combining-Window-function-and-GROUP-BY-in-Spark2-x-tp21131.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.