date:20250223

Re: [PR] [SPARK-51258][SQL][FOLLOWUP] Remove unnecessary inheritance from SQLConfHelper [spark]

2025-02-23 Thread via GitHub



beliefer commented on PR #50046:
URL: https://github.com/apache/spark/pull/50046#issuecomment-2676739969

   Merged into branch-4.0/master
   @LuciferYang Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51292][SQL] Remove unnecessary inheritance from PlanTestBase, ExpressionEvalHelper and PlanTest [spark]

2025-02-23 Thread via GitHub



beliefer closed pull request #50047: [SPARK-51292][SQL] Remove unnecessary 
inheritance from PlanTestBase, ExpressionEvalHelper and PlanTest
URL: https://github.com/apache/spark/pull/50047


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51293][CORE][SQL][SS][MLLIB][TESTS] Cleanup unused private functions from test suites [spark]

2025-02-23 Thread via GitHub



LuciferYang closed pull request #50049: 
[SPARK-51293][CORE][SQL][SS][MLLIB][TESTS] Cleanup unused private functions 
from test suites
URL: https://github.com/apache/spark/pull/50049


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51293][CORE][SQL][SS][MLLIB][TESTS] Cleanup unused private functions from test suites [spark]

2025-02-23 Thread via GitHub



LuciferYang commented on PR #50049:
URL: https://github.com/apache/spark/pull/50049#issuecomment-2676746744

   Merged into master. Thanks @dongjoon-hyun and @beliefer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51292][SQL] Remove unnecessary inheritance from PlanTestBase, ExpressionEvalHelper and PlanTest [spark]

2025-02-23 Thread via GitHub



beliefer commented on PR #50047:
URL: https://github.com/apache/spark/pull/50047#issuecomment-2676747067

   Merged into branch-4.0/master
   @LuciferYang Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51298][WIP] Support variant in CSV scan [spark]

2025-02-23 Thread via GitHub



chenhao-db opened a new pull request, #50052:
URL: https://github.com/apache/spark/pull/50052

   ### What changes were proposed in this pull request?
   
   ### Why are the changes needed?
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [DRAFT] Resolve default string exprs [spark]

2025-02-23 Thread via GitHub



stefankandic opened a new pull request, #50053:
URL: https://github.com/apache/spark/pull/50053

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once [spark]

2025-02-23 Thread via GitHub



dusantism-db commented on code in PR #50027:
URL: https://github.com/apache/spark/pull/50027#discussion_r1966931924


##
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala:
##
@@ -599,6 +599,116 @@ class CaseStatementExec(
   }
 }
 
+/**
+ * Executable node for SimpleCaseStatement.
+ * @param caseVariableExec Statement with which all conditionExpressions will 
be compared to.
+ * @param conditionExpressions Collection of expressions which correspond to 
WHEN clauses.
+ * @param conditionalBodies Collection of executable bodies that have a 
corresponding condition,
+ * in WHEN branches.
+ * @param elseBody Body that is executed if none of the conditions are met, 
i.e. ELSE branch.
+ * @param session Spark session that SQL script is executed within.
+ * @param context SqlScriptingExecutionContext keeps the execution state of 
current script.
+ */
+class SimpleCaseStatementExec(
+caseVariableExec: SingleStatementExec,
+conditionExpressions: Seq[Expression],
+conditionalBodies: Seq[CompoundBodyExec],
+elseBody: Option[CompoundBodyExec],
+session: SparkSession,
+context: SqlScriptingExecutionContext) extends NonLeafStatementExec {
+  private object CaseState extends Enumeration {
+val Condition, Body = Value
+  }
+
+  private var state = CaseState.Condition
+  var bodyExec: Option[CompoundBodyExec] = None
+
+  var conditionBodyTupleIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] = _
+  private var caseVariableLiteral: Literal = _
+
+  private var isCacheValid = false
+  private def validateCache(): Unit = {
+if (!isCacheValid) {
+  val values = caseVariableExec.buildDataFrame(session).collect()
+  caseVariableExec.isExecuted = true
+
+  caseVariableLiteral = Literal(values.head.get(0))
+  conditionBodyTupleIterator = createConditionBodyIterator
+  isCacheValid = true
+}
+  }
+
+  private def cachedCaseVariableLiteral: Literal = {
+validateCache()
+caseVariableLiteral
+  }
+
+  private def cachedConditionBodyIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] = {
+validateCache()
+conditionBodyTupleIterator
+  }
+
+  private lazy val treeIterator: Iterator[CompoundStatementExec] =
+new Iterator[CompoundStatementExec] {
+  override def hasNext: Boolean = state match {
+case CaseState.Condition => cachedConditionBodyIterator.hasNext || 
elseBody.isDefined
+case CaseState.Body => bodyExec.exists(_.getTreeIterator.hasNext)
+  }
+
+  override def next(): CompoundStatementExec = state match {
+case CaseState.Condition =>
+  cachedConditionBodyIterator.nextOption()
+.map { case (condStmt, body) =>
+  if (evaluateBooleanCondition(session, condStmt)) {
+bodyExec = Some(body)
+state = CaseState.Body
+  }
+  condStmt
+}
+.orElse(elseBody.map { body => {
+  bodyExec = Some(body)
+  state = CaseState.Body
+  next()
+}})
+.get
+case CaseState.Body => bodyExec.get.getTreeIterator.next()
+  }
+}
+
+  private def createConditionBodyIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] =
+conditionExpressions.zip(conditionalBodies)
+  .iterator
+  .map { case (expr, body) =>
+val condition = Project(
+  Seq(Alias(EqualTo(cachedCaseVariableLiteral, expr), "condition")()),
+  OneRowRelation()
+)
+// We hack the Origin to provide more descriptive error messages. For 
example, if
+// the case variable is 1 and the condition expression it's compared 
to is 5, we
+// will get Origin with text "(1 = 5)".
+val conditionText = 
condition.projectList.head.asInstanceOf[Alias].child.toString
+val condStmt = new SingleStatementExec(
+  condition,
+  Origin(sqlText = Some(conditionText),
+startIndex = Some(0),
+stopIndex = Some(conditionText.length - 1)),
+  Map.empty,
+  isInternal = true,
+  context = context
+)
+(condStmt, body)
+  }
+
+  override def getTreeIterator: Iterator[CompoundStatementExec] = treeIterator
+
+  override def reset(): Unit = {
+state = CaseState.Condition

Review Comment:
   You're right, added it.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/SqlScriptingLogicalPlans.scala:
##
@@ -226,7 +226,7 @@ case class IterateStatement(label: String) extends 
CompoundPlanStatement {
  *  in WHEN branches.
  * @param elseBody Body that is executed if none of the conditions are met, 
i.e. ELSE branch.
  */
-case class CaseStatement(
+case class SearchedCaseStatement(

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to

Re: [PR] [SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once [spark]

2025-02-23 Thread via GitHub



dusantism-db commented on code in PR #50027:
URL: https://github.com/apache/spark/pull/50027#discussion_r1966944493


##
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala:
##
@@ -599,6 +599,116 @@ class CaseStatementExec(
   }
 }
 
+/**
+ * Executable node for SimpleCaseStatement.
+ * @param caseVariableExec Statement with which all conditionExpressions will 
be compared to.
+ * @param conditionExpressions Collection of expressions which correspond to 
WHEN clauses.
+ * @param conditionalBodies Collection of executable bodies that have a 
corresponding condition,
+ * in WHEN branches.
+ * @param elseBody Body that is executed if none of the conditions are met, 
i.e. ELSE branch.
+ * @param session Spark session that SQL script is executed within.
+ * @param context SqlScriptingExecutionContext keeps the execution state of 
current script.
+ */
+class SimpleCaseStatementExec(
+caseVariableExec: SingleStatementExec,
+conditionExpressions: Seq[Expression],
+conditionalBodies: Seq[CompoundBodyExec],
+elseBody: Option[CompoundBodyExec],
+session: SparkSession,
+context: SqlScriptingExecutionContext) extends NonLeafStatementExec {
+  private object CaseState extends Enumeration {
+val Condition, Body = Value
+  }
+
+  private var state = CaseState.Condition
+  var bodyExec: Option[CompoundBodyExec] = None
+
+  var conditionBodyTupleIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] = _
+  private var caseVariableLiteral: Literal = _
+
+  private var isCacheValid = false
+  private def validateCache(): Unit = {
+if (!isCacheValid) {
+  val values = caseVariableExec.buildDataFrame(session).collect()
+  caseVariableExec.isExecuted = true
+
+  caseVariableLiteral = Literal(values.head.get(0))
+  conditionBodyTupleIterator = createConditionBodyIterator
+  isCacheValid = true
+}
+  }
+
+  private def cachedCaseVariableLiteral: Literal = {
+validateCache()
+caseVariableLiteral
+  }
+
+  private def cachedConditionBodyIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] = {
+validateCache()
+conditionBodyTupleIterator
+  }
+
+  private lazy val treeIterator: Iterator[CompoundStatementExec] =
+new Iterator[CompoundStatementExec] {
+  override def hasNext: Boolean = state match {
+case CaseState.Condition => cachedConditionBodyIterator.hasNext || 
elseBody.isDefined
+case CaseState.Body => bodyExec.exists(_.getTreeIterator.hasNext)
+  }
+
+  override def next(): CompoundStatementExec = state match {
+case CaseState.Condition =>
+  cachedConditionBodyIterator.nextOption()
+.map { case (condStmt, body) =>
+  if (evaluateBooleanCondition(session, condStmt)) {
+bodyExec = Some(body)
+state = CaseState.Body
+  }
+  condStmt
+}
+.orElse(elseBody.map { body => {
+  bodyExec = Some(body)
+  state = CaseState.Body
+  next()
+}})
+.get
+case CaseState.Body => bodyExec.get.getTreeIterator.next()
+  }
+}
+
+  private def createConditionBodyIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] =
+conditionExpressions.zip(conditionalBodies)
+  .iterator
+  .map { case (expr, body) =>
+val condition = Project(
+  Seq(Alias(EqualTo(cachedCaseVariableLiteral, expr), "condition")()),
+  OneRowRelation()
+)
+// We hack the Origin to provide more descriptive error messages. For 
example, if
+// the case variable is 1 and the condition expression it's compared 
to is 5, we
+// will get Origin with text "(1 = 5)".
+val conditionText = 
condition.projectList.head.asInstanceOf[Alias].child.toString
+val condStmt = new SingleStatementExec(
+  condition,
+  Origin(sqlText = Some(conditionText),
+startIndex = Some(0),
+stopIndex = Some(conditionText.length - 1)),

Review Comment:
   We have tests for these exceptions, they are in 
`SqlScriptingInterpreterSuite`.
   You're right it will screw up line numbers, perhaps it would be better to 
copy the Origin from the original caseVariable expression. This will keep the 
proper line number, however would just point to one part of the equality 
expression.
   
   For example, if we have this script:
   ```
   BEGIN
 CASE 1
   WHEN NULL THEN
 SELECT 41;
   ELSE
 SELECT 43;
 END CASE;
   END
   ```
   The error we would have if we kept the Origin from case variable expression:
   `{LINE:3} [BOOLEAN_STATEMENT_WITH_EMPTY_ROW] Boolean statement 1 is invalid. 
Expected single row with a value of the BOOLEAN type, but got an empty row. 
SQLSTATE: 21000`
   
   The error with the hacked origin:
   `[BOOLEAN_STATEMENT_WITH_EMPTY_ROW] Boolean statement (1 = NULL) is invali

[PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]

2025-02-23 Thread via GitHub



HyukjinKwon opened a new pull request, #50054:
URL: https://github.com/apache/spark/pull/50054

   ### What changes were proposed in this pull request?
   
   This PR fixes the installation page for PySpark to clarify spark.remote and 
spark.master in pyspark-connect and pyspark-client at install.rst
   
   ### Why are the changes needed?
   
   To clarify spark.remote and spark.master in pyspark-connect and 
pyspark-client
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. doc-only change.
   
   ### How was this patch tested?
   
   CI build in this PR
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DRAFT] Resolve default string exprs [spark]

2025-02-23 Thread via GitHub



HyukjinKwon commented on code in PR #50053:
URL: https://github.com/apache/spark/pull/50053#discussion_r1966967731


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDDLCommandStringTypes.scala:
##
@@ -28,6 +29,10 @@ import org.apache.spark.sql.types.{DataType, StringType}
  * collation from the corresponding object (table/view -> schema -> catalog).
  */
 object ResolveDDLCommandStringTypes extends Rule[LogicalPlan] {
+  // Tag to mark expressions that have been cast to a new type so that we can
+  // avoid infinite recursion when resolving the same expression multiple 
times.
+  private val CAST_ADDED_TAG = new 
TreeNodeTag[Unit]("defaultStringExpressionCastAdded")

Review Comment:
   The tag is gone when the nodes are copied IIRC so I don't think this can be 
used for controlling a behaviour



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51294][SQL][CONNECT][TESTS] Improve the readability by split the variable of jars and configs for SparkConnectServerUtils. [spark]

2025-02-23 Thread via GitHub



HyukjinKwon commented on PR #50050:
URL: https://github.com/apache/spark/pull/50050#issuecomment-2677269920

   Merged to master and branch-4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51256][SQL] Increase parallelism if joining with small bucket table [spark]

2025-02-23 Thread via GitHub



wangyum commented on PR #50004:
URL: https://github.com/apache/spark/pull/50004#issuecomment-2677280562

   ```scala
   spark.sql("set spark.sql.autoBroadcastJoinThreshold=-1")
   spark.range(1000).selectExpr("id", "id + 1 as 
new_id").write.saveAsTable("t1")
   spark.range(10).selectExpr("id").write.bucketBy(1, "id").saveAsTable("t2")
   spark.sql("select * from t1 join t2 on t1.id = t2.id").explain("cost")
   ```
   Spark 3.2:
   ```
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- SortMergeJoin [id#21L], [id#23L], Inner
  :- Sort [id#21L ASC NULLS FIRST], false, 0
  :  +- Exchange hashpartitioning(id#21L, 200), ENSURE_REQUIREMENTS, 
[plan_id=50]
  : +- Filter isnotnull(id#21L)
  :+- FileScan parquet default.t1[id#21L,new_id#22L] Batched: true, 
DataFilters: [isnotnull(id#21L)], Format: Parquet, Location: 
InMemoryFileIndex(1 
paths)[file:/Users/yumwang/Downloads/spark-3.2.4-bin-hadoop3.2/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct
  +- Sort [id#23L ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(id#23L, 200), ENSURE_REQUIREMENTS, 
[plan_id=57]
+- Filter isnotnull(id#23L)
   +- FileScan parquet default.t2[id#23L] Batched: true, 
DataFilters: [isnotnull(id#23L)], Format: Parquet, Location: 
InMemoryFileIndex(1 
paths)[file:/Users/yumwang/Downloads/spark-3.2.4-bin-hadoop3.2/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct
   ```
   
   After Spark 3.3:
   ```
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- SortMergeJoin [id#21L], [id#23L], Inner
  :- Sort [id#21L ASC NULLS FIRST], false, 0
  :  +- Exchange hashpartitioning(id#21L, 1), ENSURE_REQUIREMENTS, 
[plan_id=51]
  : +- Filter isnotnull(id#21L)
  :+- FileScan parquet default.t1[id#21L,new_id#22L] Batched: true, 
DataFilters: [isnotnull(id#21L)], Format: Parquet, Location: 
InMemoryFileIndex(1 
paths)[file:/Users/yumwang/Downloads/spark-3.3.3-bin-hadoop3/spark-warehouse/t1],
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct
  +- Sort [id#23L ASC NULLS FIRST], false, 0
 +- Filter isnotnull(id#23L)
+- FileScan parquet default.t2[id#23L] Batched: true, Bucketed: 
true, DataFilters: [isnotnull(id#23L)], Format: Parquet, Location: 
InMemoryFileIndex(1 
paths)[file:/Users/yumwang/Downloads/spark-3.3.3-bin-hadoop3/spark-warehouse/t2],
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct, SelectedBucketsCount: 1 out of 1
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51294][SQL][CONNECT][TESTS] Improve the readability by split the variable of jars and configs for SparkConnectServerUtils. [spark]

2025-02-23 Thread via GitHub



HyukjinKwon closed pull request #50050: [SPARK-51294][SQL][CONNECT][TESTS] 
Improve the readability by split the variable of jars and configs for 
SparkConnectServerUtils.
URL: https://github.com/apache/spark/pull/50050


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][DOCS] Fixed the scope of the query option in sql-data-sources-jdbc.md [spark]

2025-02-23 Thread via GitHub



llphxd commented on PR #50048:
URL: https://github.com/apache/spark/pull/50048#issuecomment-2676709230

   > > I would love to, but I failed to register jira and have to wait 24 
hours. So I'll create the issue later.
   > 
   > Please tell me if the issue is ready.
   
   issue is ready. 
   [https://issues.apache.org/jira/browse/SPARK-51297](url)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51258][SQL][FOLLOWUP] Remove unnecessary inheritance from SQLConfHelper [spark]

2025-02-23 Thread via GitHub



beliefer closed pull request #50046: [SPARK-51258][SQL][FOLLOWUP] Remove 
unnecessary inheritance from SQLConfHelper
URL: https://github.com/apache/spark/pull/50046


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

2025-02-23 Thread via GitHub



ahshahid commented on code in PR #50033:
URL: https://github.com/apache/spark/pull/50033#discussion_r1966860945


##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -3185,16 +3217,103 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
   "Spark can only do this while using the new shuffle block fetching 
protocol"))
   }
 
+  test("SPARK-51272: retry all the succeeding stages when the map stage is 
indeterminate with" +
+" concurrent tasks completion") {
+if (scheduler != null) {
+  this.afterEach()
+}
+
+val monitor = new Object()
+val resubmitFailedStageReached = Array.fill[Boolean](1)(false)
+this.dagSchedulerInterceptor = new DagSchedulerInterceptor {
+  override def interceptHandleTaskCompletion(event: CompletionEvent): Unit 
= {
+event.reason match {
+  case Success if event.task.isInstanceOf[ResultTask[_, _]] =>
+assert(resubmitFailedStageReached(0))
+monitor.synchronized {
+  monitor.notify()
+}
+
+  case _ =>
+}
+  }
+
+  override def interceptResubmitFailedStages(): Unit = {
+monitor.synchronized {
+  resubmitFailedStageReached(0) = true
+  monitor.notify()
+  monitor.wait()
+}
+  }
+}
+
+this.beforeEach()
+
+val numPartitions = 2
+val (shuffleId1, shuffleId2) = constructTwoIndeterminateStage()
+completeShuffleMapStageSuccessfully(shuffleId2, 0, numPartitions)
+val resultStage = scheduler.stageIdToStage(2).asInstanceOf[ResultStage]
+val activeJob = resultStage.activeJob
+assert(activeJob.isDefined)
+// The result stage is still waiting for its 2 tasks to complete
+assert(resultStage.findMissingPartitions() == 
Seq.tabulate(numPartitions)(i => i))
+new Thread(() => {
+  runEventInCurrentThread(
+makeCompletionEvent(
+  taskSets(2).tasks(0),
+  FetchFailed(makeBlockManagerId("hostA"), shuffleId1, 0L, 0, 0, 
"ignored"),
+  null))
+}).start()
+
+monitor.synchronized {
+  if (!resubmitFailedStageReached(0)) {
+monitor.wait()
+  }
+}
+assert(resubmitFailedStageReached(0))
+new Thread(() => {
+  runEventInCurrentThread(makeCompletionEvent(taskSets(2).tasks(1), 
Success, 11))
+}).start()
+
+val shuffleStage1 = this.scheduler.shuffleIdToMapStage(shuffleId1)
+val shuffleStage2 = this.scheduler.shuffleIdToMapStage(shuffleId2)
+var keepGoing = true
+while (keepGoing) {
+  Thread.sleep(500)
+  keepGoing = shuffleStage1.latestInfo.attemptNumber() != 1
+}
+completeShuffleMapStageSuccessfully(0, 1, numPartitions)
+keepGoing = true
+while (keepGoing) {
+  Thread.sleep(500)
+  keepGoing = shuffleStage2.latestInfo.attemptNumber() != 1
+}
+
+completeShuffleMapStageSuccessfully(1, 1, numPartitions)
+keepGoing = true
+while (keepGoing) {
+  Thread.sleep(500)
+  keepGoing = resultStage.latestInfo.attemptNumber() != 1
+}
+
+assert(resultStage.latestInfo.numTasks == 2)
+  }
+
   test("SPARK-25341: retry all the succeeding stages when the map stage is 
indeterminate") {
 val (shuffleId1, shuffleId2) = constructIndeterminateStageFetchFailed()
 
 // Check status for all failedStages
 val failedStages = scheduler.failedStages.toSeq
 assert(failedStages.map(_.id) == Seq(1, 2))
 // Shuffle blocks of "hostC" is lost, so first task of the 
`shuffleMapRdd2` needs to retry.
+// TODO: Asif THIS ASSERTION APPEARS TO BE WRONG. As the ShuffleMapStage 
is inDeterminate all

Review Comment:
   done.



##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -4163,9 +4282,16 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 val failedStages = scheduler.failedStages.toSeq
 assert(failedStages.map(_.id) == Seq(1, 2))
 // Shuffle blocks of "hostC" is lost, so first task of the 
`shuffleMapRdd2` needs to retry.
+// TODO: Asif THIS ASSERTION APPEARS TO BE WRONG. As the ShuffleMapStage 
is inDeterminate all

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once [spark]

2025-02-23 Thread via GitHub



dusantism-db commented on code in PR #50027:
URL: https://github.com/apache/spark/pull/50027#discussion_r1966944493


##
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala:
##
@@ -599,6 +599,116 @@ class CaseStatementExec(
   }
 }
 
+/**
+ * Executable node for SimpleCaseStatement.
+ * @param caseVariableExec Statement with which all conditionExpressions will 
be compared to.
+ * @param conditionExpressions Collection of expressions which correspond to 
WHEN clauses.
+ * @param conditionalBodies Collection of executable bodies that have a 
corresponding condition,
+ * in WHEN branches.
+ * @param elseBody Body that is executed if none of the conditions are met, 
i.e. ELSE branch.
+ * @param session Spark session that SQL script is executed within.
+ * @param context SqlScriptingExecutionContext keeps the execution state of 
current script.
+ */
+class SimpleCaseStatementExec(
+caseVariableExec: SingleStatementExec,
+conditionExpressions: Seq[Expression],
+conditionalBodies: Seq[CompoundBodyExec],
+elseBody: Option[CompoundBodyExec],
+session: SparkSession,
+context: SqlScriptingExecutionContext) extends NonLeafStatementExec {
+  private object CaseState extends Enumeration {
+val Condition, Body = Value
+  }
+
+  private var state = CaseState.Condition
+  var bodyExec: Option[CompoundBodyExec] = None
+
+  var conditionBodyTupleIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] = _
+  private var caseVariableLiteral: Literal = _
+
+  private var isCacheValid = false
+  private def validateCache(): Unit = {
+if (!isCacheValid) {
+  val values = caseVariableExec.buildDataFrame(session).collect()
+  caseVariableExec.isExecuted = true
+
+  caseVariableLiteral = Literal(values.head.get(0))
+  conditionBodyTupleIterator = createConditionBodyIterator
+  isCacheValid = true
+}
+  }
+
+  private def cachedCaseVariableLiteral: Literal = {
+validateCache()
+caseVariableLiteral
+  }
+
+  private def cachedConditionBodyIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] = {
+validateCache()
+conditionBodyTupleIterator
+  }
+
+  private lazy val treeIterator: Iterator[CompoundStatementExec] =
+new Iterator[CompoundStatementExec] {
+  override def hasNext: Boolean = state match {
+case CaseState.Condition => cachedConditionBodyIterator.hasNext || 
elseBody.isDefined
+case CaseState.Body => bodyExec.exists(_.getTreeIterator.hasNext)
+  }
+
+  override def next(): CompoundStatementExec = state match {
+case CaseState.Condition =>
+  cachedConditionBodyIterator.nextOption()
+.map { case (condStmt, body) =>
+  if (evaluateBooleanCondition(session, condStmt)) {
+bodyExec = Some(body)
+state = CaseState.Body
+  }
+  condStmt
+}
+.orElse(elseBody.map { body => {
+  bodyExec = Some(body)
+  state = CaseState.Body
+  next()
+}})
+.get
+case CaseState.Body => bodyExec.get.getTreeIterator.next()
+  }
+}
+
+  private def createConditionBodyIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] =
+conditionExpressions.zip(conditionalBodies)
+  .iterator
+  .map { case (expr, body) =>
+val condition = Project(
+  Seq(Alias(EqualTo(cachedCaseVariableLiteral, expr), "condition")()),
+  OneRowRelation()
+)
+// We hack the Origin to provide more descriptive error messages. For 
example, if
+// the case variable is 1 and the condition expression it's compared 
to is 5, we
+// will get Origin with text "(1 = 5)".
+val conditionText = 
condition.projectList.head.asInstanceOf[Alias].child.toString
+val condStmt = new SingleStatementExec(
+  condition,
+  Origin(sqlText = Some(conditionText),
+startIndex = Some(0),
+stopIndex = Some(conditionText.length - 1)),

Review Comment:
   We have tests for these exceptions, they are in SqlScriptingInterpreterSuite.
   You're right it will screw up line numbers, perhaps it would be better to 
copy the origin from the original caseVariable expression. This will keep the 
proper line number, however would just point to one part of the equality 
expression.
   
   For example, if we have this script:
   ```
   BEGIN
 CASE 1
   WHEN NULL THEN
 SELECT 41;
   ELSE
 SELECT 43;
 END CASE;
   END
   ```
   The error we would have if we kept the Origin from case variable expression:
   `{LINE:3} [BOOLEAN_STATEMENT_WITH_EMPTY_ROW] Boolean statement 1 is invalid. 
Expected single row with a value of the BOOLEAN type, but got an empty row. 
SQLSTATE: 21000`
   
   The error with the hacked origin:
   `[BOOLEAN_STATEMENT_WITH_EMPTY_ROW] Boolean statement (1 = NULL) is invalid.

Re: [PR] [SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once [spark]

2025-02-23 Thread via GitHub



dusantism-db commented on code in PR #50027:
URL: https://github.com/apache/spark/pull/50027#discussion_r1966944963


##
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala:
##
@@ -599,6 +599,116 @@ class CaseStatementExec(
   }
 }
 
+/**
+ * Executable node for SimpleCaseStatement.
+ * @param caseVariableExec Statement with which all conditionExpressions will 
be compared to.
+ * @param conditionExpressions Collection of expressions which correspond to 
WHEN clauses.
+ * @param conditionalBodies Collection of executable bodies that have a 
corresponding condition,
+ * in WHEN branches.
+ * @param elseBody Body that is executed if none of the conditions are met, 
i.e. ELSE branch.
+ * @param session Spark session that SQL script is executed within.
+ * @param context SqlScriptingExecutionContext keeps the execution state of 
current script.
+ */
+class SimpleCaseStatementExec(
+caseVariableExec: SingleStatementExec,
+conditionExpressions: Seq[Expression],
+conditionalBodies: Seq[CompoundBodyExec],
+elseBody: Option[CompoundBodyExec],
+session: SparkSession,
+context: SqlScriptingExecutionContext) extends NonLeafStatementExec {
+  private object CaseState extends Enumeration {
+val Condition, Body = Value
+  }
+
+  private var state = CaseState.Condition
+  var bodyExec: Option[CompoundBodyExec] = None
+
+  var conditionBodyTupleIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] = _
+  private var caseVariableLiteral: Literal = _
+
+  private var isCacheValid = false
+  private def validateCache(): Unit = {
+if (!isCacheValid) {
+  val values = caseVariableExec.buildDataFrame(session).collect()
+  caseVariableExec.isExecuted = true
+
+  caseVariableLiteral = Literal(values.head.get(0))
+  conditionBodyTupleIterator = createConditionBodyIterator
+  isCacheValid = true
+}
+  }
+
+  private def cachedCaseVariableLiteral: Literal = {
+validateCache()
+caseVariableLiteral
+  }
+
+  private def cachedConditionBodyIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] = {
+validateCache()
+conditionBodyTupleIterator
+  }
+
+  private lazy val treeIterator: Iterator[CompoundStatementExec] =
+new Iterator[CompoundStatementExec] {
+  override def hasNext: Boolean = state match {
+case CaseState.Condition => cachedConditionBodyIterator.hasNext || 
elseBody.isDefined
+case CaseState.Body => bodyExec.exists(_.getTreeIterator.hasNext)
+  }
+
+  override def next(): CompoundStatementExec = state match {
+case CaseState.Condition =>
+  cachedConditionBodyIterator.nextOption()
+.map { case (condStmt, body) =>
+  if (evaluateBooleanCondition(session, condStmt)) {
+bodyExec = Some(body)
+state = CaseState.Body
+  }
+  condStmt
+}
+.orElse(elseBody.map { body => {
+  bodyExec = Some(body)
+  state = CaseState.Body
+  next()
+}})
+.get
+case CaseState.Body => bodyExec.get.getTreeIterator.next()
+  }
+}
+
+  private def createConditionBodyIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] =
+conditionExpressions.zip(conditionalBodies)
+  .iterator
+  .map { case (expr, body) =>
+val condition = Project(
+  Seq(Alias(EqualTo(cachedCaseVariableLiteral, expr), "condition")()),
+  OneRowRelation()
+)
+// We hack the Origin to provide more descriptive error messages. For 
example, if
+// the case variable is 1 and the condition expression it's compared 
to is 5, we
+// will get Origin with text "(1 = 5)".
+val conditionText = 
condition.projectList.head.asInstanceOf[Alias].child.toString
+val condStmt = new SingleStatementExec(
+  condition,
+  Origin(sqlText = Some(conditionText),
+startIndex = Some(0),
+stopIndex = Some(conditionText.length - 1)),

Review Comment:
   I guess we could expand the hack to copy the just line number from 
caseVariableExec? That way we would have both the line number and the proper 
equality expression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DRAFT] Two string types [spark]

2025-02-23 Thread via GitHub



github-actions[bot] commented on PR #48861:
URL: https://github.com/apache/spark/pull/48861#issuecomment-2677214294

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50292] Add MapStatus RowCount optimize skewed job [spark]

2025-02-23 Thread via GitHub



github-actions[bot] closed pull request #48825: [SPARK-50292] Add MapStatus 
RowCount optimize skewed job
URL: https://github.com/apache/spark/pull/48825


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50319] Reorder ResolveIdentifierClause and BindParameter rules [spark]

2025-02-23 Thread via GitHub



github-actions[bot] commented on PR #48849:
URL: https://github.com/apache/spark/pull/48849#issuecomment-2677214302

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]

2025-02-23 Thread via GitHub



HyukjinKwon closed pull request #50054: [MINOR][DOCS] Clarify spark.remote and 
spark.master in pyspark-connect and pyspark-client at install.rst
URL: https://github.com/apache/spark/pull/50054


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]

2025-02-23 Thread via GitHub



HyukjinKwon commented on PR #50054:
URL: https://github.com/apache/spark/pull/50054#issuecomment-2677426348

   Merged to master and branch-4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51299][SQL][UI] MetricUtils.stringValue should filter metric values with initValue rather than a hardcoded value [spark]

2025-02-23 Thread via GitHub



jiwen624 commented on PR #50055:
URL: https://github.com/apache/spark/pull/50055#issuecomment-2677443309

   @dongjoon-hyun @cloud-fan could you let me know your thoughts about this 
fix? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51282][ML][PYTHON][CONNECT] Optimize OneVsRestModel transform by eliminating the JVM-Python data exchange [spark]

2025-02-23 Thread via GitHub



zhengruifeng commented on PR #50041:
URL: https://github.com/apache/spark/pull/50041#issuecomment-2677458005

   Let me convert it to draft for now, to make it easy to debug another issue 
https://issues.apache.org/jira/browse/SPARK-51118


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]

2025-02-23 Thread via GitHub



beliefer commented on code in PR #50040:
URL: https://github.com/apache/spark/pull/50040#discussion_r1967061151


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -5545,6 +5545,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION =
+buildConf("spark.sql.legacy.dataFrameWriterV2IgnorePathOption")
+  .internal()
+  .doc("When set to true, DataFrameWriterV2 ignores the 'path' option and 
always write data " +
+"to the default table location.")

Review Comment:
   Could we fix the bug? It means users might already use the 'path' option.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]

2025-02-23 Thread via GitHub



beliefer commented on code in PR #50040:
URL: https://github.com/apache/spark/pull/50040#discussion_r1967072225


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -5545,6 +5545,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION =
+buildConf("spark.sql.legacy.dataFrameWriterV2IgnorePathOption")
+  .internal()
+  .doc("When set to true, DataFrameWriterV2 ignores the 'path' option and 
always write data " +
+"to the default table location.")

Review Comment:
   I got it. The bug exists about 3 years. So let the bug as a legacy behavior 
looks better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51300][PS][DOCS] Fix broken link for `ps.sql` [spark]

2025-02-23 Thread via GitHub



itholic opened a new pull request, #50056:
URL: https://github.com/apache/spark/pull/50056

   ### What changes were proposed in this pull request?
   
   This PR proposes to fix broken link for `ps.sql`
   
   ### Why are the changes needed?
   
   There is broken link in official documentation for `ps.sql`
   
   
   https://github.com/user-attachments/assets/de70f1a9-35a2-4adb-80e4-d9726170344e";
 />
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No API changes, but the user-facing documentation improvement.
   
   ### How was this patch tested?
   
   Manually tested, and the existing doc build in CI should pass
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51206][PYTHON][CONNECT] Move Arrow conversion helpers out of Spark Connect [spark]

2025-02-23 Thread via GitHub



zhengruifeng commented on code in PR #49941:
URL: https://github.com/apache/spark/pull/49941#discussion_r1967091437


##
python/pyspark/sql/conversion.py:
##
@@ -0,0 +1,543 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from pyspark.sql.pandas.utils import require_minimum_pyarrow_version
+
+# Arrow is required for arrow conversion.
+require_minimum_pyarrow_version()
+
+
+import array
+import datetime
+import decimal
+from typing import Any, Callable, List, Sequence
+
+import pyarrow as pa
+from pyspark.errors import PySparkValueError
+from pyspark.sql.pandas.types import _dedup_names, _deduplicate_field_names, 
to_arrow_schema
+from pyspark.sql.types import (
+ArrayType,
+BinaryType,
+DataType,
+DecimalType,
+MapType,
+NullType,
+Row,
+StringType,
+StructField,
+StructType,
+TimestampNTZType,
+TimestampType,
+UserDefinedType,
+VariantType,
+VariantVal,
+_create_row,
+)
+
+
+class LocalDataToArrowConversion:
+"""
+Conversion from local data (except pandas DataFrame and numpy ndarray) to 
Arrow.
+"""
+
+@staticmethod
+def _need_converter(
+dataType: DataType,
+nullable: bool = True,
+) -> bool:
+if not nullable:
+# always check the nullability
+return True
+elif isinstance(dataType, NullType):
+# always check the nullability
+return True
+elif isinstance(dataType, StructType):
+# Struct maybe rows, should convert to dict.
+return True
+elif isinstance(dataType, ArrayType):
+return LocalDataToArrowConversion._need_converter(
+dataType.elementType, dataType.containsNull
+)
+elif isinstance(dataType, MapType):
+# Different from PySpark, here always needs conversion,
+# since an Arrow Map requires a list of tuples.
+return True
+elif isinstance(dataType, BinaryType):
+return True
+elif isinstance(dataType, (TimestampType, TimestampNTZType)):
+# Always truncate
+return True
+elif isinstance(dataType, DecimalType):
+# Convert Decimal('NaN') to None
+return True
+elif isinstance(dataType, StringType):
+# Coercion to StringType is allowed
+return True
+elif isinstance(dataType, UserDefinedType):
+return True
+elif isinstance(dataType, VariantType):
+return True
+else:
+return False
+
+@staticmethod
+def _create_converter(
+dataType: DataType,
+nullable: bool = True,
+variants_as_dicts: bool = False,  # some code paths may require python 
internal types
+) -> Callable:
+assert dataType is not None and isinstance(dataType, DataType)
+assert isinstance(nullable, bool)
+
+if not LocalDataToArrowConversion._need_converter(dataType, nullable):
+return lambda value: value
+
+if isinstance(dataType, NullType):
+
+def convert_null(value: Any) -> Any:
+if value is not None:
+raise PySparkValueError(f"input for {dataType} must be 
None, but got {value}")
+return None
+
+return convert_null
+
+elif isinstance(dataType, StructType):
+field_names = dataType.fieldNames()
+dedup_field_names = _dedup_names(dataType.names)
+
+field_convs = [
+LocalDataToArrowConversion._create_converter(
+field.dataType, field.nullable, variants_as_dicts
+)
+for field in dataType.fields
+]
+
+def convert_struct(value: Any) -> Any:
+if value is None:
+if not nullable:
+raise PySparkValueError(f"input for {dataType} must 
not be None")
+return None
+else:
+assert isinstance(value, (tuple, dict)) or hasattr(
+value, "__dict__"
+), f"{type(value)} {value}"
+
+_dict = {}
+

[PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]

2025-02-23 Thread via GitHub



pan3793 opened a new pull request, #50057:
URL: https://github.com/apache/spark/pull/50057

   
   
   ### What changes were proposed in this pull request?
   
   Bump zstd-jni to the latest version.
   
   ### Why are the changes needed?
   
   https://github.com/facebook/zstd/releases/tag/v1.5.7
   https://github.com/luben/zstd-jni/releases/tag/v1.5.7-1
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass GHA. Will update benchmark soon
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]

2025-02-23 Thread via GitHub



pan3793 commented on PR #50057:
URL: https://github.com/apache/spark/pull/50057#issuecomment-2677646689

   cc @dongjoon-hyun @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once [spark]

2025-02-23 Thread via GitHub



davidm-db commented on code in PR #50027:
URL: https://github.com/apache/spark/pull/50027#discussion_r1967153704


##
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala:
##
@@ -599,6 +599,116 @@ class CaseStatementExec(
   }
 }
 
+/**
+ * Executable node for SimpleCaseStatement.
+ * @param caseVariableExec Statement with which all conditionExpressions will 
be compared to.
+ * @param conditionExpressions Collection of expressions which correspond to 
WHEN clauses.
+ * @param conditionalBodies Collection of executable bodies that have a 
corresponding condition,
+ * in WHEN branches.
+ * @param elseBody Body that is executed if none of the conditions are met, 
i.e. ELSE branch.
+ * @param session Spark session that SQL script is executed within.
+ * @param context SqlScriptingExecutionContext keeps the execution state of 
current script.
+ */
+class SimpleCaseStatementExec(
+caseVariableExec: SingleStatementExec,
+conditionExpressions: Seq[Expression],
+conditionalBodies: Seq[CompoundBodyExec],
+elseBody: Option[CompoundBodyExec],
+session: SparkSession,
+context: SqlScriptingExecutionContext) extends NonLeafStatementExec {
+  private object CaseState extends Enumeration {
+val Condition, Body = Value
+  }
+
+  private var state = CaseState.Condition
+  var bodyExec: Option[CompoundBodyExec] = None
+
+  var conditionBodyTupleIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] = _
+  private var caseVariableLiteral: Literal = _
+
+  private var isCacheValid = false
+  private def validateCache(): Unit = {
+if (!isCacheValid) {
+  val values = caseVariableExec.buildDataFrame(session).collect()
+  caseVariableExec.isExecuted = true
+
+  caseVariableLiteral = Literal(values.head.get(0))
+  conditionBodyTupleIterator = createConditionBodyIterator
+  isCacheValid = true
+}
+  }
+
+  private def cachedCaseVariableLiteral: Literal = {
+validateCache()
+caseVariableLiteral
+  }
+
+  private def cachedConditionBodyIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] = {
+validateCache()
+conditionBodyTupleIterator
+  }
+
+  private lazy val treeIterator: Iterator[CompoundStatementExec] =
+new Iterator[CompoundStatementExec] {
+  override def hasNext: Boolean = state match {
+case CaseState.Condition => cachedConditionBodyIterator.hasNext || 
elseBody.isDefined
+case CaseState.Body => bodyExec.exists(_.getTreeIterator.hasNext)
+  }
+
+  override def next(): CompoundStatementExec = state match {
+case CaseState.Condition =>
+  cachedConditionBodyIterator.nextOption()
+.map { case (condStmt, body) =>
+  if (evaluateBooleanCondition(session, condStmt)) {
+bodyExec = Some(body)
+state = CaseState.Body
+  }
+  condStmt
+}
+.orElse(elseBody.map { body => {
+  bodyExec = Some(body)
+  state = CaseState.Body
+  next()
+}})
+.get
+case CaseState.Body => bodyExec.get.getTreeIterator.next()
+  }
+}
+
+  private def createConditionBodyIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] =
+conditionExpressions.zip(conditionalBodies)
+  .iterator
+  .map { case (expr, body) =>
+val condition = Project(
+  Seq(Alias(EqualTo(cachedCaseVariableLiteral, expr), "condition")()),
+  OneRowRelation()
+)
+// We hack the Origin to provide more descriptive error messages. For 
example, if
+// the case variable is 1 and the condition expression it's compared 
to is 5, we
+// will get Origin with text "(1 = 5)".
+val conditionText = 
condition.projectList.head.asInstanceOf[Alias].child.toString
+val condStmt = new SingleStatementExec(
+  condition,
+  Origin(sqlText = Some(conditionText),
+startIndex = Some(0),
+stopIndex = Some(conditionText.length - 1)),

Review Comment:
   If it's an easy fix (and it looks like it is) I would definitely go with 
this approach (copying the line number to the hacked origin). If not, I think 
it's better to keep the line number and then improve the messaging later on as 
a follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50785][SQL] Refactor FOR statement to utilize local variables properly. [spark]

2025-02-23 Thread via GitHub



cloud-fan commented on code in PR #50026:
URL: https://github.com/apache/spark/pull/50026#discussion_r1967012051


##
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala:
##
@@ -206,6 +207,15 @@ class TriggerToExceptionHandlerMap(
   def getNotFoundHandler: Option[ExceptionHandlerExec] = notFoundHandler
 }
 
+object TriggerToExceptionHandlerMap {
+  def empty: TriggerToExceptionHandlerMap = new TriggerToExceptionHandlerMap(

Review Comment:
   let's be very cautious about any global state. Can we write detailed comment 
to explain how this global state is used and cleanup? Or try our best to not 
use this global state.



##
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala:
##
@@ -206,6 +207,15 @@ class TriggerToExceptionHandlerMap(
   def getNotFoundHandler: Option[ExceptionHandlerExec] = notFoundHandler
 }
 
+object TriggerToExceptionHandlerMap {
+  def empty: TriggerToExceptionHandlerMap = new TriggerToExceptionHandlerMap(

Review Comment:
   let's be very cautious about any global state. Can we write detailed comment 
to explain how this global state is used and cleaned up? Or try our best to not 
use this global state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-23 Thread via GitHub



wangyum commented on PR #50022:
URL: https://github.com/apache/spark/pull/50022#issuecomment-2677370449

   @shrprasa Does it work after applying the patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]

2025-02-23 Thread via GitHub



dongjoon-hyun commented on code in PR #50040:
URL: https://github.com/apache/spark/pull/50040#discussion_r1967023059


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala:
##
@@ -839,4 +839,26 @@ class DataFrameWriterV2Suite extends QueryTest with 
SharedSparkSession with Befo
   condition = "CALL_ON_STREAMING_DATASET_UNSUPPORTED",
   parameters = Map("methodName" -> "`writeTo`"))
   }
+
+  test("SPARK-51281: create/replace file source tables") {
+Seq(true, false).foreach { ignorePath =>
+  withSQLConf(SQLConf.LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION.key -> 
ignorePath.toString) {
+withTable("t1", "t2") {
+  spark.range(10).writeTo("t1").using("json").create()
+  checkAnswer(spark.table("t1"), spark.range(10).toDF())

Review Comment:
   nit. `spark.range(10).toDF()` seems to be repeated three times. Shall we 
define one to reuse during this unit test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]

2025-02-23 Thread via GitHub



dongjoon-hyun commented on code in PR #50040:
URL: https://github.com/apache/spark/pull/50040#discussion_r1967023059


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala:
##
@@ -839,4 +839,26 @@ class DataFrameWriterV2Suite extends QueryTest with 
SharedSparkSession with Befo
   condition = "CALL_ON_STREAMING_DATASET_UNSUPPORTED",
   parameters = Map("methodName" -> "`writeTo`"))
   }
+
+  test("SPARK-51281: create/replace file source tables") {
+Seq(true, false).foreach { ignorePath =>
+  withSQLConf(SQLConf.LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION.key -> 
ignorePath.toString) {
+withTable("t1", "t2") {
+  spark.range(10).writeTo("t1").using("json").create()
+  checkAnswer(spark.table("t1"), spark.range(10).toDF())

Review Comment:
   `spark.range(10).toDF()` seems to be repeated three times. Shall we define 
one to reuse during this unit test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]

2025-02-23 Thread via GitHub



cloud-fan commented on code in PR #50040:
URL: https://github.com/apache/spark/pull/50040#discussion_r1966991532


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -5545,6 +5545,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION =
+buildConf("spark.sql.legacy.dataFrameWriterV2IgnorePathOption")
+  .internal()
+  .doc("When set to true, DataFrameWriterV2 ignores the 'path' option and 
always write data " +
+"to the default table location.")
+  .version("4.0.0")

Review Comment:
   ```suggestion
 .version("3.5.5")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-23 Thread via GitHub



pan3793 commented on code in PR #50022:
URL: https://github.com/apache/spark/pull/50022#discussion_r1967007721


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1407,13 +1409,62 @@ private[hive] object HiveClientImpl extends Logging {
   case _ =>
 new HiveConf(conf, classOf[HiveConf])
 }
-try {
+val hive = try {
   Hive.getWithoutRegisterFns(hiveConf)
 } catch {
   // SPARK-37069: not all Hive versions have the above method (e.g., Hive 
2.3.9 has it but
-  // 2.3.8 don't), therefore here we fallback when encountering the 
exception.
+  // 2.3.8 doesn't), therefore here we fallback when encountering the 
exception.
   case _: NoSuchMethodError =>
 Hive.get(hiveConf)
 }
+configureMaxThriftMessageSize(hiveConf, hive.getMSC)
+hive
+  }
+
+  // SPARK-49489: a surgery for Hive 2.3.10 due to lack of HIVE-26633
+  private def configureMaxThriftMessageSize(
+  hiveConf: HiveConf, msClient: IMetaStoreClient): Unit = try {
+msClient match {
+  // Hive uses Java Dynamic Proxy to enhance the MetaStoreClient to 
support synchronization
+  // and retrying, we should unwrap and access the real MetaStoreClient 
instance firstly
+  case proxy if JdkProxy.isProxyClass(proxy.getClass) =>
+JdkProxy.getInvocationHandler(proxy) match {
+  case syncHandler if 
syncHandler.getClass.getName.endsWith("SynchronizedHandler") =>
+val realMscField = SparkClassUtils.classForName(

Review Comment:
   @LuciferYang thanks for the review, I simplify the reflection call by 
following your suggestions.
   
   For concerns about the reflection code itself, I tune the code to make the 
added code only takes effect when the user configures 
`hive.thrift.client.max.message.size` explicitly, in case users compile the 
Spark with their own modified Hive version.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51297][DOCS] Fixed the scope of the query option in sql-data-sources-jdbc.md [spark]

2025-02-23 Thread via GitHub



beliefer commented on PR #50048:
URL: https://github.com/apache/spark/pull/50048#issuecomment-2677392603

   Merged into branch-4.0/master
   @llphxd @HyukjinKwon @yaooqinn Thank you all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP][SPARK-51299][SQL][Web UI] Filter out invalid metric values with initValue [spark]

2025-02-23 Thread via GitHub



jiwen624 opened a new pull request, #50055:
URL: https://github.com/apache/spark/pull/50055

   ### What changes were proposed in this pull request?
   This PR proposes to use `initValue` of a metric in 
`org.apache.spark.util.MetricUtils.stringValue` instead of a hardcoded init 
value to filter out invalid metric values.
   
   ### Why are the changes needed?
   In method `org.apache.spark.util.MetricUtils.stringValue`, it uses a 
hardcoded value 0 to filter out invalid metric values for SIZE_METRIC, 
TIMING_METRIC and NS_TIMING_METRIC:
   
   `val validValues = values.filter(_ >= 0) 
   `
   However, we offer methods to create these types of metrics with initValue 
other than -1 (introduced in this PR 
https://github.com/apache/spark/pull/41555) :
   
   `def createSizeMetric(sc: SparkContext, name: String, initValue: Long = -1): 
SQLMetric = {
   `
   which means there is a chance that the metrics are created with a initValue 
!= -1 and in this case the filter above will generate incorrect results.
   
   ### Does this PR introduce any user-facing change?
   No
   
   ### How was this patch tested?
   Existing UTs
   
   Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-23 Thread via GitHub



pan3793 commented on code in PR #50022:
URL: https://github.com/apache/spark/pull/50022#discussion_r1967004950


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -81,14 +82,18 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
* Due to classloader isolation issues, pattern matching won't work here so 
we need
* to compare the canonical names of the exceptions, which we assume to be 
stable.
*/
-  private def isClientException(e: Throwable): Boolean = {
-var temp: Class[_] = e.getClass
-var found = false
-while (temp != null && !found) {
-  found = clientExceptions.contains(temp.getCanonicalName)
-  temp = temp.getSuperclass
-}
-found
+  @tailrec
+  private def isClientException(e: Throwable): Boolean = e match {
+case re: RuntimeException if re.getCause != null =>
+  isClientException(re.getCause)

Review Comment:
   To cover `RuntimeException` throw by `Hive.getMSC`
   ```
   Cause: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
  at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1742)
  at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:83)
  at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:133)
  at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
  at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3607)
  at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3659)
  at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3639)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$.getHive(HiveClientImpl.scala:1420)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:269)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:236)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:235)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:285)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:420)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:192)
  at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.scala:17)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:100)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:192)
  ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51294][SQL][CONNECT][TESTS] Improve the readability by split the variable of jars and configs for SparkConnectServerUtils. [spark]

2025-02-23 Thread via GitHub



beliefer commented on PR #50050:
URL: https://github.com/apache/spark/pull/50050#issuecomment-2677381793

   @HyukjinKwon Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51297][DOCS] Fixed the scope of the query option in sql-data-sources-jdbc.md [spark]

2025-02-23 Thread via GitHub



beliefer closed pull request #50048: [SPARK-51297][DOCS] Fixed the scope of the 
query option in sql-data-sources-jdbc.md
URL: https://github.com/apache/spark/pull/50048


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51278][PYTHON] Use appropriate structure of JSON format for `PySparkLogger` [spark]

2025-02-23 Thread via GitHub



itholic commented on code in PR #50038:
URL: https://github.com/apache/spark/pull/50038#discussion_r1967030513


##
python/pyspark/logger/logger.py:
##
@@ -66,10 +67,21 @@ def format(self, record: logging.LogRecord) -> str:
 }
 if record.exc_info:
 exc_type, exc_value, exc_tb = record.exc_info
+stacktrace = traceback.extract_tb(exc_tb)
+
+structured_stacktrace = [
+{
+"class": exc_type.__name__ if exc_type else 
"UnknownException",

Review Comment:
   Makes sense. Just applied the suggestion. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]

2025-02-23 Thread via GitHub



HyukjinKwon commented on code in PR #50054:
URL: https://github.com/apache/spark/pull/50054#discussion_r1967033629


##
python/docs/source/getting_started/install.rst:
##
@@ -96,9 +96,7 @@ If you want to make Spark Connect default, you can install 
and additional librar
 It will automatically install ``pyspark`` library as well as dependencies that 
are necessary for Spark Connect.
 If you want to customize ``pyspark``, you need to install ``pyspark`` with the 
instructions above in advance.
 
-Note that ``pyspark`` command will use ``--master`` option for Spark Connect 
remote URL instead of ``--remote`` option.
-
-See also `Quickstart: Spark Connect `_ for how to use 
it.
+This package supports both ``spark.master`` (``--master``) with a locally 
running Spark Connect server, and ``spark.remote`` (``--remote``) including 
``local*`` as well as connection URIs such as ``sc://localhost``. See also 
`Quickstart: Spark Connect `_ for how to use it.

Review Comment:
   let me fix a little bit more



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]

2025-02-23 Thread via GitHub



HyukjinKwon commented on code in PR #50054:
URL: https://github.com/apache/spark/pull/50054#discussion_r1967033498


##
python/docs/source/getting_started/install.rst:
##
@@ -96,9 +96,7 @@ If you want to make Spark Connect default, you can install 
and additional librar
 It will automatically install ``pyspark`` library as well as dependencies that 
are necessary for Spark Connect.
 If you want to customize ``pyspark``, you need to install ``pyspark`` with the 
instructions above in advance.
 
-Note that ``pyspark`` command will use ``--master`` option for Spark Connect 
remote URL instead of ``--remote`` option.
-
-See also `Quickstart: Spark Connect `_ for how to use 
it.
+This package supports both ``spark.master`` (``--master``) with a locally 
running Spark Connect server, and ``spark.remote`` (``--remote``) including 
``local*`` as well as connection URIs such as ``sc://localhost``. See also 
`Quickstart: Spark Connect `_ for how to use it.

Review Comment:
   Actually I also meant local-cluster too :-)..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]

2025-02-23 Thread via GitHub



dongjoon-hyun commented on code in PR #50054:
URL: https://github.com/apache/spark/pull/50054#discussion_r1967034041


##
python/docs/source/getting_started/install.rst:
##
@@ -96,9 +96,7 @@ If you want to make Spark Connect default, you can install 
and additional librar
 It will automatically install ``pyspark`` library as well as dependencies that 
are necessary for Spark Connect.
 If you want to customize ``pyspark``, you need to install ``pyspark`` with the 
instructions above in advance.
 
-Note that ``pyspark`` command will use ``--master`` option for Spark Connect 
remote URL instead of ``--remote`` option.
-
-See also `Quickstart: Spark Connect `_ for how to use 
it.
+This package supports both ``spark.master`` (``--master``) with a locally 
running Spark Connect server, and ``spark.remote`` (``--remote``) including 
``local*`` as well as connection URIs such as ``sc://localhost``. See also 
`Quickstart: Spark Connect `_ for how to use it.

Review Comment:
   Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]

2025-02-23 Thread via GitHub



dongjoon-hyun commented on code in PR #50054:
URL: https://github.com/apache/spark/pull/50054#discussion_r1967030929


##
python/docs/source/getting_started/install.rst:
##
@@ -96,9 +96,7 @@ If you want to make Spark Connect default, you can install 
and additional librar
 It will automatically install ``pyspark`` library as well as dependencies that 
are necessary for Spark Connect.
 If you want to customize ``pyspark``, you need to install ``pyspark`` with the 
instructions above in advance.
 
-Note that ``pyspark`` command will use ``--master`` option for Spark Connect 
remote URL instead of ``--remote`` option.
-
-See also `Quickstart: Spark Connect `_ for how to use 
it.
+This package supports both ``spark.master`` (``--master``) with a locally 
running Spark Connect server, and ``spark.remote`` (``--remote``) including 
``local*`` as well as connection URIs such as ``sc://localhost``. See also 
`Quickstart: Spark Connect `_ for how to use it.

Review Comment:
   Does `local*` mean `local[K,F]` syntax in this line, @HyukjinKwon ? If then, 
it looks a little confusing to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

49 matches

Mail list logo