Re: [PR] [SPARK-51258][SQL][FOLLOWUP] Remove unnecessary inheritance from SQLConfHelper [spark]
beliefer commented on PR #50046: URL: https://github.com/apache/spark/pull/50046#issuecomment-2676739969 Merged into branch-4.0/master @LuciferYang Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51292][SQL] Remove unnecessary inheritance from PlanTestBase, ExpressionEvalHelper and PlanTest [spark]
beliefer closed pull request #50047: [SPARK-51292][SQL] Remove unnecessary inheritance from PlanTestBase, ExpressionEvalHelper and PlanTest URL: https://github.com/apache/spark/pull/50047 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51293][CORE][SQL][SS][MLLIB][TESTS] Cleanup unused private functions from test suites [spark]
LuciferYang closed pull request #50049: [SPARK-51293][CORE][SQL][SS][MLLIB][TESTS] Cleanup unused private functions from test suites URL: https://github.com/apache/spark/pull/50049 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51293][CORE][SQL][SS][MLLIB][TESTS] Cleanup unused private functions from test suites [spark]
LuciferYang commented on PR #50049: URL: https://github.com/apache/spark/pull/50049#issuecomment-2676746744 Merged into master. Thanks @dongjoon-hyun and @beliefer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51292][SQL] Remove unnecessary inheritance from PlanTestBase, ExpressionEvalHelper and PlanTest [spark]
beliefer commented on PR #50047: URL: https://github.com/apache/spark/pull/50047#issuecomment-2676747067 Merged into branch-4.0/master @LuciferYang Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51298][WIP] Support variant in CSV scan [spark]
chenhao-db opened a new pull request, #50052: URL: https://github.com/apache/spark/pull/50052 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [DRAFT] Resolve default string exprs [spark]
stefankandic opened a new pull request, #50053: URL: https://github.com/apache/spark/pull/50053 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once [spark]
dusantism-db commented on code in PR #50027: URL: https://github.com/apache/spark/pull/50027#discussion_r1966931924 ## sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala: ## @@ -599,6 +599,116 @@ class CaseStatementExec( } } +/** + * Executable node for SimpleCaseStatement. + * @param caseVariableExec Statement with which all conditionExpressions will be compared to. + * @param conditionExpressions Collection of expressions which correspond to WHEN clauses. + * @param conditionalBodies Collection of executable bodies that have a corresponding condition, + * in WHEN branches. + * @param elseBody Body that is executed if none of the conditions are met, i.e. ELSE branch. + * @param session Spark session that SQL script is executed within. + * @param context SqlScriptingExecutionContext keeps the execution state of current script. + */ +class SimpleCaseStatementExec( +caseVariableExec: SingleStatementExec, +conditionExpressions: Seq[Expression], +conditionalBodies: Seq[CompoundBodyExec], +elseBody: Option[CompoundBodyExec], +session: SparkSession, +context: SqlScriptingExecutionContext) extends NonLeafStatementExec { + private object CaseState extends Enumeration { +val Condition, Body = Value + } + + private var state = CaseState.Condition + var bodyExec: Option[CompoundBodyExec] = None + + var conditionBodyTupleIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = _ + private var caseVariableLiteral: Literal = _ + + private var isCacheValid = false + private def validateCache(): Unit = { +if (!isCacheValid) { + val values = caseVariableExec.buildDataFrame(session).collect() + caseVariableExec.isExecuted = true + + caseVariableLiteral = Literal(values.head.get(0)) + conditionBodyTupleIterator = createConditionBodyIterator + isCacheValid = true +} + } + + private def cachedCaseVariableLiteral: Literal = { +validateCache() +caseVariableLiteral + } + + private def cachedConditionBodyIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = { +validateCache() +conditionBodyTupleIterator + } + + private lazy val treeIterator: Iterator[CompoundStatementExec] = +new Iterator[CompoundStatementExec] { + override def hasNext: Boolean = state match { +case CaseState.Condition => cachedConditionBodyIterator.hasNext || elseBody.isDefined +case CaseState.Body => bodyExec.exists(_.getTreeIterator.hasNext) + } + + override def next(): CompoundStatementExec = state match { +case CaseState.Condition => + cachedConditionBodyIterator.nextOption() +.map { case (condStmt, body) => + if (evaluateBooleanCondition(session, condStmt)) { +bodyExec = Some(body) +state = CaseState.Body + } + condStmt +} +.orElse(elseBody.map { body => { + bodyExec = Some(body) + state = CaseState.Body + next() +}}) +.get +case CaseState.Body => bodyExec.get.getTreeIterator.next() + } +} + + private def createConditionBodyIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = +conditionExpressions.zip(conditionalBodies) + .iterator + .map { case (expr, body) => +val condition = Project( + Seq(Alias(EqualTo(cachedCaseVariableLiteral, expr), "condition")()), + OneRowRelation() +) +// We hack the Origin to provide more descriptive error messages. For example, if +// the case variable is 1 and the condition expression it's compared to is 5, we +// will get Origin with text "(1 = 5)". +val conditionText = condition.projectList.head.asInstanceOf[Alias].child.toString +val condStmt = new SingleStatementExec( + condition, + Origin(sqlText = Some(conditionText), +startIndex = Some(0), +stopIndex = Some(conditionText.length - 1)), + Map.empty, + isInternal = true, + context = context +) +(condStmt, body) + } + + override def getTreeIterator: Iterator[CompoundStatementExec] = treeIterator + + override def reset(): Unit = { +state = CaseState.Condition Review Comment: You're right, added it. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/SqlScriptingLogicalPlans.scala: ## @@ -226,7 +226,7 @@ case class IterateStatement(label: String) extends CompoundPlanStatement { * in WHEN branches. * @param elseBody Body that is executed if none of the conditions are met, i.e. ELSE branch. */ -case class CaseStatement( +case class SearchedCaseStatement( Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to
Re: [PR] [SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once [spark]
dusantism-db commented on code in PR #50027: URL: https://github.com/apache/spark/pull/50027#discussion_r1966944493 ## sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala: ## @@ -599,6 +599,116 @@ class CaseStatementExec( } } +/** + * Executable node for SimpleCaseStatement. + * @param caseVariableExec Statement with which all conditionExpressions will be compared to. + * @param conditionExpressions Collection of expressions which correspond to WHEN clauses. + * @param conditionalBodies Collection of executable bodies that have a corresponding condition, + * in WHEN branches. + * @param elseBody Body that is executed if none of the conditions are met, i.e. ELSE branch. + * @param session Spark session that SQL script is executed within. + * @param context SqlScriptingExecutionContext keeps the execution state of current script. + */ +class SimpleCaseStatementExec( +caseVariableExec: SingleStatementExec, +conditionExpressions: Seq[Expression], +conditionalBodies: Seq[CompoundBodyExec], +elseBody: Option[CompoundBodyExec], +session: SparkSession, +context: SqlScriptingExecutionContext) extends NonLeafStatementExec { + private object CaseState extends Enumeration { +val Condition, Body = Value + } + + private var state = CaseState.Condition + var bodyExec: Option[CompoundBodyExec] = None + + var conditionBodyTupleIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = _ + private var caseVariableLiteral: Literal = _ + + private var isCacheValid = false + private def validateCache(): Unit = { +if (!isCacheValid) { + val values = caseVariableExec.buildDataFrame(session).collect() + caseVariableExec.isExecuted = true + + caseVariableLiteral = Literal(values.head.get(0)) + conditionBodyTupleIterator = createConditionBodyIterator + isCacheValid = true +} + } + + private def cachedCaseVariableLiteral: Literal = { +validateCache() +caseVariableLiteral + } + + private def cachedConditionBodyIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = { +validateCache() +conditionBodyTupleIterator + } + + private lazy val treeIterator: Iterator[CompoundStatementExec] = +new Iterator[CompoundStatementExec] { + override def hasNext: Boolean = state match { +case CaseState.Condition => cachedConditionBodyIterator.hasNext || elseBody.isDefined +case CaseState.Body => bodyExec.exists(_.getTreeIterator.hasNext) + } + + override def next(): CompoundStatementExec = state match { +case CaseState.Condition => + cachedConditionBodyIterator.nextOption() +.map { case (condStmt, body) => + if (evaluateBooleanCondition(session, condStmt)) { +bodyExec = Some(body) +state = CaseState.Body + } + condStmt +} +.orElse(elseBody.map { body => { + bodyExec = Some(body) + state = CaseState.Body + next() +}}) +.get +case CaseState.Body => bodyExec.get.getTreeIterator.next() + } +} + + private def createConditionBodyIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = +conditionExpressions.zip(conditionalBodies) + .iterator + .map { case (expr, body) => +val condition = Project( + Seq(Alias(EqualTo(cachedCaseVariableLiteral, expr), "condition")()), + OneRowRelation() +) +// We hack the Origin to provide more descriptive error messages. For example, if +// the case variable is 1 and the condition expression it's compared to is 5, we +// will get Origin with text "(1 = 5)". +val conditionText = condition.projectList.head.asInstanceOf[Alias].child.toString +val condStmt = new SingleStatementExec( + condition, + Origin(sqlText = Some(conditionText), +startIndex = Some(0), +stopIndex = Some(conditionText.length - 1)), Review Comment: We have tests for these exceptions, they are in `SqlScriptingInterpreterSuite`. You're right it will screw up line numbers, perhaps it would be better to copy the Origin from the original caseVariable expression. This will keep the proper line number, however would just point to one part of the equality expression. For example, if we have this script: ``` BEGIN CASE 1 WHEN NULL THEN SELECT 41; ELSE SELECT 43; END CASE; END ``` The error we would have if we kept the Origin from case variable expression: `{LINE:3} [BOOLEAN_STATEMENT_WITH_EMPTY_ROW] Boolean statement 1 is invalid. Expected single row with a value of the BOOLEAN type, but got an empty row. SQLSTATE: 21000` The error with the hacked origin: `[BOOLEAN_STATEMENT_WITH_EMPTY_ROW] Boolean statement (1 = NULL) is invali
[PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]
HyukjinKwon opened a new pull request, #50054: URL: https://github.com/apache/spark/pull/50054 ### What changes were proposed in this pull request? This PR fixes the installation page for PySpark to clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst ### Why are the changes needed? To clarify spark.remote and spark.master in pyspark-connect and pyspark-client ### Does this PR introduce _any_ user-facing change? No. doc-only change. ### How was this patch tested? CI build in this PR ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [DRAFT] Resolve default string exprs [spark]
HyukjinKwon commented on code in PR #50053: URL: https://github.com/apache/spark/pull/50053#discussion_r1966967731 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDDLCommandStringTypes.scala: ## @@ -28,6 +29,10 @@ import org.apache.spark.sql.types.{DataType, StringType} * collation from the corresponding object (table/view -> schema -> catalog). */ object ResolveDDLCommandStringTypes extends Rule[LogicalPlan] { + // Tag to mark expressions that have been cast to a new type so that we can + // avoid infinite recursion when resolving the same expression multiple times. + private val CAST_ADDED_TAG = new TreeNodeTag[Unit]("defaultStringExpressionCastAdded") Review Comment: The tag is gone when the nodes are copied IIRC so I don't think this can be used for controlling a behaviour -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51294][SQL][CONNECT][TESTS] Improve the readability by split the variable of jars and configs for SparkConnectServerUtils. [spark]
HyukjinKwon commented on PR #50050: URL: https://github.com/apache/spark/pull/50050#issuecomment-2677269920 Merged to master and branch-4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51256][SQL] Increase parallelism if joining with small bucket table [spark]
wangyum commented on PR #50004: URL: https://github.com/apache/spark/pull/50004#issuecomment-2677280562 ```scala spark.sql("set spark.sql.autoBroadcastJoinThreshold=-1") spark.range(1000).selectExpr("id", "id + 1 as new_id").write.saveAsTable("t1") spark.range(10).selectExpr("id").write.bucketBy(1, "id").saveAsTable("t2") spark.sql("select * from t1 join t2 on t1.id = t2.id").explain("cost") ``` Spark 3.2: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- SortMergeJoin [id#21L], [id#23L], Inner :- Sort [id#21L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#21L, 200), ENSURE_REQUIREMENTS, [plan_id=50] : +- Filter isnotnull(id#21L) :+- FileScan parquet default.t1[id#21L,new_id#22L] Batched: true, DataFilters: [isnotnull(id#21L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yumwang/Downloads/spark-3.2.4-bin-hadoop3.2/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct +- Sort [id#23L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#23L, 200), ENSURE_REQUIREMENTS, [plan_id=57] +- Filter isnotnull(id#23L) +- FileScan parquet default.t2[id#23L] Batched: true, DataFilters: [isnotnull(id#23L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yumwang/Downloads/spark-3.2.4-bin-hadoop3.2/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct ``` After Spark 3.3: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- SortMergeJoin [id#21L], [id#23L], Inner :- Sort [id#21L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#21L, 1), ENSURE_REQUIREMENTS, [plan_id=51] : +- Filter isnotnull(id#21L) :+- FileScan parquet default.t1[id#21L,new_id#22L] Batched: true, DataFilters: [isnotnull(id#21L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yumwang/Downloads/spark-3.3.3-bin-hadoop3/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct +- Sort [id#23L ASC NULLS FIRST], false, 0 +- Filter isnotnull(id#23L) +- FileScan parquet default.t2[id#23L] Batched: true, Bucketed: true, DataFilters: [isnotnull(id#23L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yumwang/Downloads/spark-3.3.3-bin-hadoop3/spark-warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct, SelectedBucketsCount: 1 out of 1 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51294][SQL][CONNECT][TESTS] Improve the readability by split the variable of jars and configs for SparkConnectServerUtils. [spark]
HyukjinKwon closed pull request #50050: [SPARK-51294][SQL][CONNECT][TESTS] Improve the readability by split the variable of jars and configs for SparkConnectServerUtils. URL: https://github.com/apache/spark/pull/50050 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Fixed the scope of the query option in sql-data-sources-jdbc.md [spark]
llphxd commented on PR #50048: URL: https://github.com/apache/spark/pull/50048#issuecomment-2676709230 > > I would love to, but I failed to register jira and have to wait 24 hours. So I'll create the issue later. > > Please tell me if the issue is ready. issue is ready. [https://issues.apache.org/jira/browse/SPARK-51297](url) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51258][SQL][FOLLOWUP] Remove unnecessary inheritance from SQLConfHelper [spark]
beliefer closed pull request #50046: [SPARK-51258][SQL][FOLLOWUP] Remove unnecessary inheritance from SQLConfHelper URL: https://github.com/apache/spark/pull/50046 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]
ahshahid commented on code in PR #50033: URL: https://github.com/apache/spark/pull/50033#discussion_r1966860945 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -3185,16 +3217,103 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti "Spark can only do this while using the new shuffle block fetching protocol")) } + test("SPARK-51272: retry all the succeeding stages when the map stage is indeterminate with" + +" concurrent tasks completion") { +if (scheduler != null) { + this.afterEach() +} + +val monitor = new Object() +val resubmitFailedStageReached = Array.fill[Boolean](1)(false) +this.dagSchedulerInterceptor = new DagSchedulerInterceptor { + override def interceptHandleTaskCompletion(event: CompletionEvent): Unit = { +event.reason match { + case Success if event.task.isInstanceOf[ResultTask[_, _]] => +assert(resubmitFailedStageReached(0)) +monitor.synchronized { + monitor.notify() +} + + case _ => +} + } + + override def interceptResubmitFailedStages(): Unit = { +monitor.synchronized { + resubmitFailedStageReached(0) = true + monitor.notify() + monitor.wait() +} + } +} + +this.beforeEach() + +val numPartitions = 2 +val (shuffleId1, shuffleId2) = constructTwoIndeterminateStage() +completeShuffleMapStageSuccessfully(shuffleId2, 0, numPartitions) +val resultStage = scheduler.stageIdToStage(2).asInstanceOf[ResultStage] +val activeJob = resultStage.activeJob +assert(activeJob.isDefined) +// The result stage is still waiting for its 2 tasks to complete +assert(resultStage.findMissingPartitions() == Seq.tabulate(numPartitions)(i => i)) +new Thread(() => { + runEventInCurrentThread( +makeCompletionEvent( + taskSets(2).tasks(0), + FetchFailed(makeBlockManagerId("hostA"), shuffleId1, 0L, 0, 0, "ignored"), + null)) +}).start() + +monitor.synchronized { + if (!resubmitFailedStageReached(0)) { +monitor.wait() + } +} +assert(resubmitFailedStageReached(0)) +new Thread(() => { + runEventInCurrentThread(makeCompletionEvent(taskSets(2).tasks(1), Success, 11)) +}).start() + +val shuffleStage1 = this.scheduler.shuffleIdToMapStage(shuffleId1) +val shuffleStage2 = this.scheduler.shuffleIdToMapStage(shuffleId2) +var keepGoing = true +while (keepGoing) { + Thread.sleep(500) + keepGoing = shuffleStage1.latestInfo.attemptNumber() != 1 +} +completeShuffleMapStageSuccessfully(0, 1, numPartitions) +keepGoing = true +while (keepGoing) { + Thread.sleep(500) + keepGoing = shuffleStage2.latestInfo.attemptNumber() != 1 +} + +completeShuffleMapStageSuccessfully(1, 1, numPartitions) +keepGoing = true +while (keepGoing) { + Thread.sleep(500) + keepGoing = resultStage.latestInfo.attemptNumber() != 1 +} + +assert(resultStage.latestInfo.numTasks == 2) + } + test("SPARK-25341: retry all the succeeding stages when the map stage is indeterminate") { val (shuffleId1, shuffleId2) = constructIndeterminateStageFetchFailed() // Check status for all failedStages val failedStages = scheduler.failedStages.toSeq assert(failedStages.map(_.id) == Seq(1, 2)) // Shuffle blocks of "hostC" is lost, so first task of the `shuffleMapRdd2` needs to retry. +// TODO: Asif THIS ASSERTION APPEARS TO BE WRONG. As the ShuffleMapStage is inDeterminate all Review Comment: done. ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4163,9 +4282,16 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti val failedStages = scheduler.failedStages.toSeq assert(failedStages.map(_.id) == Seq(1, 2)) // Shuffle blocks of "hostC" is lost, so first task of the `shuffleMapRdd2` needs to retry. +// TODO: Asif THIS ASSERTION APPEARS TO BE WRONG. As the ShuffleMapStage is inDeterminate all Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once [spark]
dusantism-db commented on code in PR #50027: URL: https://github.com/apache/spark/pull/50027#discussion_r1966944493 ## sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala: ## @@ -599,6 +599,116 @@ class CaseStatementExec( } } +/** + * Executable node for SimpleCaseStatement. + * @param caseVariableExec Statement with which all conditionExpressions will be compared to. + * @param conditionExpressions Collection of expressions which correspond to WHEN clauses. + * @param conditionalBodies Collection of executable bodies that have a corresponding condition, + * in WHEN branches. + * @param elseBody Body that is executed if none of the conditions are met, i.e. ELSE branch. + * @param session Spark session that SQL script is executed within. + * @param context SqlScriptingExecutionContext keeps the execution state of current script. + */ +class SimpleCaseStatementExec( +caseVariableExec: SingleStatementExec, +conditionExpressions: Seq[Expression], +conditionalBodies: Seq[CompoundBodyExec], +elseBody: Option[CompoundBodyExec], +session: SparkSession, +context: SqlScriptingExecutionContext) extends NonLeafStatementExec { + private object CaseState extends Enumeration { +val Condition, Body = Value + } + + private var state = CaseState.Condition + var bodyExec: Option[CompoundBodyExec] = None + + var conditionBodyTupleIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = _ + private var caseVariableLiteral: Literal = _ + + private var isCacheValid = false + private def validateCache(): Unit = { +if (!isCacheValid) { + val values = caseVariableExec.buildDataFrame(session).collect() + caseVariableExec.isExecuted = true + + caseVariableLiteral = Literal(values.head.get(0)) + conditionBodyTupleIterator = createConditionBodyIterator + isCacheValid = true +} + } + + private def cachedCaseVariableLiteral: Literal = { +validateCache() +caseVariableLiteral + } + + private def cachedConditionBodyIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = { +validateCache() +conditionBodyTupleIterator + } + + private lazy val treeIterator: Iterator[CompoundStatementExec] = +new Iterator[CompoundStatementExec] { + override def hasNext: Boolean = state match { +case CaseState.Condition => cachedConditionBodyIterator.hasNext || elseBody.isDefined +case CaseState.Body => bodyExec.exists(_.getTreeIterator.hasNext) + } + + override def next(): CompoundStatementExec = state match { +case CaseState.Condition => + cachedConditionBodyIterator.nextOption() +.map { case (condStmt, body) => + if (evaluateBooleanCondition(session, condStmt)) { +bodyExec = Some(body) +state = CaseState.Body + } + condStmt +} +.orElse(elseBody.map { body => { + bodyExec = Some(body) + state = CaseState.Body + next() +}}) +.get +case CaseState.Body => bodyExec.get.getTreeIterator.next() + } +} + + private def createConditionBodyIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = +conditionExpressions.zip(conditionalBodies) + .iterator + .map { case (expr, body) => +val condition = Project( + Seq(Alias(EqualTo(cachedCaseVariableLiteral, expr), "condition")()), + OneRowRelation() +) +// We hack the Origin to provide more descriptive error messages. For example, if +// the case variable is 1 and the condition expression it's compared to is 5, we +// will get Origin with text "(1 = 5)". +val conditionText = condition.projectList.head.asInstanceOf[Alias].child.toString +val condStmt = new SingleStatementExec( + condition, + Origin(sqlText = Some(conditionText), +startIndex = Some(0), +stopIndex = Some(conditionText.length - 1)), Review Comment: We have tests for these exceptions, they are in SqlScriptingInterpreterSuite. You're right it will screw up line numbers, perhaps it would be better to copy the origin from the original caseVariable expression. This will keep the proper line number, however would just point to one part of the equality expression. For example, if we have this script: ``` BEGIN CASE 1 WHEN NULL THEN SELECT 41; ELSE SELECT 43; END CASE; END ``` The error we would have if we kept the Origin from case variable expression: `{LINE:3} [BOOLEAN_STATEMENT_WITH_EMPTY_ROW] Boolean statement 1 is invalid. Expected single row with a value of the BOOLEAN type, but got an empty row. SQLSTATE: 21000` The error with the hacked origin: `[BOOLEAN_STATEMENT_WITH_EMPTY_ROW] Boolean statement (1 = NULL) is invalid.
Re: [PR] [SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once [spark]
dusantism-db commented on code in PR #50027: URL: https://github.com/apache/spark/pull/50027#discussion_r1966944963 ## sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala: ## @@ -599,6 +599,116 @@ class CaseStatementExec( } } +/** + * Executable node for SimpleCaseStatement. + * @param caseVariableExec Statement with which all conditionExpressions will be compared to. + * @param conditionExpressions Collection of expressions which correspond to WHEN clauses. + * @param conditionalBodies Collection of executable bodies that have a corresponding condition, + * in WHEN branches. + * @param elseBody Body that is executed if none of the conditions are met, i.e. ELSE branch. + * @param session Spark session that SQL script is executed within. + * @param context SqlScriptingExecutionContext keeps the execution state of current script. + */ +class SimpleCaseStatementExec( +caseVariableExec: SingleStatementExec, +conditionExpressions: Seq[Expression], +conditionalBodies: Seq[CompoundBodyExec], +elseBody: Option[CompoundBodyExec], +session: SparkSession, +context: SqlScriptingExecutionContext) extends NonLeafStatementExec { + private object CaseState extends Enumeration { +val Condition, Body = Value + } + + private var state = CaseState.Condition + var bodyExec: Option[CompoundBodyExec] = None + + var conditionBodyTupleIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = _ + private var caseVariableLiteral: Literal = _ + + private var isCacheValid = false + private def validateCache(): Unit = { +if (!isCacheValid) { + val values = caseVariableExec.buildDataFrame(session).collect() + caseVariableExec.isExecuted = true + + caseVariableLiteral = Literal(values.head.get(0)) + conditionBodyTupleIterator = createConditionBodyIterator + isCacheValid = true +} + } + + private def cachedCaseVariableLiteral: Literal = { +validateCache() +caseVariableLiteral + } + + private def cachedConditionBodyIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = { +validateCache() +conditionBodyTupleIterator + } + + private lazy val treeIterator: Iterator[CompoundStatementExec] = +new Iterator[CompoundStatementExec] { + override def hasNext: Boolean = state match { +case CaseState.Condition => cachedConditionBodyIterator.hasNext || elseBody.isDefined +case CaseState.Body => bodyExec.exists(_.getTreeIterator.hasNext) + } + + override def next(): CompoundStatementExec = state match { +case CaseState.Condition => + cachedConditionBodyIterator.nextOption() +.map { case (condStmt, body) => + if (evaluateBooleanCondition(session, condStmt)) { +bodyExec = Some(body) +state = CaseState.Body + } + condStmt +} +.orElse(elseBody.map { body => { + bodyExec = Some(body) + state = CaseState.Body + next() +}}) +.get +case CaseState.Body => bodyExec.get.getTreeIterator.next() + } +} + + private def createConditionBodyIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = +conditionExpressions.zip(conditionalBodies) + .iterator + .map { case (expr, body) => +val condition = Project( + Seq(Alias(EqualTo(cachedCaseVariableLiteral, expr), "condition")()), + OneRowRelation() +) +// We hack the Origin to provide more descriptive error messages. For example, if +// the case variable is 1 and the condition expression it's compared to is 5, we +// will get Origin with text "(1 = 5)". +val conditionText = condition.projectList.head.asInstanceOf[Alias].child.toString +val condStmt = new SingleStatementExec( + condition, + Origin(sqlText = Some(conditionText), +startIndex = Some(0), +stopIndex = Some(conditionText.length - 1)), Review Comment: I guess we could expand the hack to copy the just line number from caseVariableExec? That way we would have both the line number and the proper equality expression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [DRAFT] Two string types [spark]
github-actions[bot] commented on PR #48861: URL: https://github.com/apache/spark/pull/48861#issuecomment-2677214294 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50292] Add MapStatus RowCount optimize skewed job [spark]
github-actions[bot] closed pull request #48825: [SPARK-50292] Add MapStatus RowCount optimize skewed job URL: https://github.com/apache/spark/pull/48825 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50319] Reorder ResolveIdentifierClause and BindParameter rules [spark]
github-actions[bot] commented on PR #48849: URL: https://github.com/apache/spark/pull/48849#issuecomment-2677214302 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]
HyukjinKwon closed pull request #50054: [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst URL: https://github.com/apache/spark/pull/50054 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]
HyukjinKwon commented on PR #50054: URL: https://github.com/apache/spark/pull/50054#issuecomment-2677426348 Merged to master and branch-4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51299][SQL][UI] MetricUtils.stringValue should filter metric values with initValue rather than a hardcoded value [spark]
jiwen624 commented on PR #50055: URL: https://github.com/apache/spark/pull/50055#issuecomment-2677443309 @dongjoon-hyun @cloud-fan could you let me know your thoughts about this fix? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51282][ML][PYTHON][CONNECT] Optimize OneVsRestModel transform by eliminating the JVM-Python data exchange [spark]
zhengruifeng commented on PR #50041: URL: https://github.com/apache/spark/pull/50041#issuecomment-2677458005 Let me convert it to draft for now, to make it easy to debug another issue https://issues.apache.org/jira/browse/SPARK-51118 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]
beliefer commented on code in PR #50040: URL: https://github.com/apache/spark/pull/50040#discussion_r1967061151 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -5545,6 +5545,15 @@ object SQLConf { .booleanConf .createWithDefault(false) + val LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION = +buildConf("spark.sql.legacy.dataFrameWriterV2IgnorePathOption") + .internal() + .doc("When set to true, DataFrameWriterV2 ignores the 'path' option and always write data " + +"to the default table location.") Review Comment: Could we fix the bug? It means users might already use the 'path' option. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]
beliefer commented on code in PR #50040: URL: https://github.com/apache/spark/pull/50040#discussion_r1967072225 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -5545,6 +5545,15 @@ object SQLConf { .booleanConf .createWithDefault(false) + val LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION = +buildConf("spark.sql.legacy.dataFrameWriterV2IgnorePathOption") + .internal() + .doc("When set to true, DataFrameWriterV2 ignores the 'path' option and always write data " + +"to the default table location.") Review Comment: I got it. The bug exists about 3 years. So let the bug as a legacy behavior looks better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51300][PS][DOCS] Fix broken link for `ps.sql` [spark]
itholic opened a new pull request, #50056: URL: https://github.com/apache/spark/pull/50056 ### What changes were proposed in this pull request? This PR proposes to fix broken link for `ps.sql` ### Why are the changes needed? There is broken link in official documentation for `ps.sql` https://github.com/user-attachments/assets/de70f1a9-35a2-4adb-80e4-d9726170344e"; /> ### Does this PR introduce _any_ user-facing change? No API changes, but the user-facing documentation improvement. ### How was this patch tested? Manually tested, and the existing doc build in CI should pass ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51206][PYTHON][CONNECT] Move Arrow conversion helpers out of Spark Connect [spark]
zhengruifeng commented on code in PR #49941: URL: https://github.com/apache/spark/pull/49941#discussion_r1967091437 ## python/pyspark/sql/conversion.py: ## @@ -0,0 +1,543 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +from pyspark.sql.pandas.utils import require_minimum_pyarrow_version + +# Arrow is required for arrow conversion. +require_minimum_pyarrow_version() + + +import array +import datetime +import decimal +from typing import Any, Callable, List, Sequence + +import pyarrow as pa +from pyspark.errors import PySparkValueError +from pyspark.sql.pandas.types import _dedup_names, _deduplicate_field_names, to_arrow_schema +from pyspark.sql.types import ( +ArrayType, +BinaryType, +DataType, +DecimalType, +MapType, +NullType, +Row, +StringType, +StructField, +StructType, +TimestampNTZType, +TimestampType, +UserDefinedType, +VariantType, +VariantVal, +_create_row, +) + + +class LocalDataToArrowConversion: +""" +Conversion from local data (except pandas DataFrame and numpy ndarray) to Arrow. +""" + +@staticmethod +def _need_converter( +dataType: DataType, +nullable: bool = True, +) -> bool: +if not nullable: +# always check the nullability +return True +elif isinstance(dataType, NullType): +# always check the nullability +return True +elif isinstance(dataType, StructType): +# Struct maybe rows, should convert to dict. +return True +elif isinstance(dataType, ArrayType): +return LocalDataToArrowConversion._need_converter( +dataType.elementType, dataType.containsNull +) +elif isinstance(dataType, MapType): +# Different from PySpark, here always needs conversion, +# since an Arrow Map requires a list of tuples. +return True +elif isinstance(dataType, BinaryType): +return True +elif isinstance(dataType, (TimestampType, TimestampNTZType)): +# Always truncate +return True +elif isinstance(dataType, DecimalType): +# Convert Decimal('NaN') to None +return True +elif isinstance(dataType, StringType): +# Coercion to StringType is allowed +return True +elif isinstance(dataType, UserDefinedType): +return True +elif isinstance(dataType, VariantType): +return True +else: +return False + +@staticmethod +def _create_converter( +dataType: DataType, +nullable: bool = True, +variants_as_dicts: bool = False, # some code paths may require python internal types +) -> Callable: +assert dataType is not None and isinstance(dataType, DataType) +assert isinstance(nullable, bool) + +if not LocalDataToArrowConversion._need_converter(dataType, nullable): +return lambda value: value + +if isinstance(dataType, NullType): + +def convert_null(value: Any) -> Any: +if value is not None: +raise PySparkValueError(f"input for {dataType} must be None, but got {value}") +return None + +return convert_null + +elif isinstance(dataType, StructType): +field_names = dataType.fieldNames() +dedup_field_names = _dedup_names(dataType.names) + +field_convs = [ +LocalDataToArrowConversion._create_converter( +field.dataType, field.nullable, variants_as_dicts +) +for field in dataType.fields +] + +def convert_struct(value: Any) -> Any: +if value is None: +if not nullable: +raise PySparkValueError(f"input for {dataType} must not be None") +return None +else: +assert isinstance(value, (tuple, dict)) or hasattr( +value, "__dict__" +), f"{type(value)} {value}" + +_dict = {} +
[PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
pan3793 opened a new pull request, #50057: URL: https://github.com/apache/spark/pull/50057 ### What changes were proposed in this pull request? Bump zstd-jni to the latest version. ### Why are the changes needed? https://github.com/facebook/zstd/releases/tag/v1.5.7 https://github.com/luben/zstd-jni/releases/tag/v1.5.7-1 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GHA. Will update benchmark soon ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
pan3793 commented on PR #50057: URL: https://github.com/apache/spark/pull/50057#issuecomment-2677646689 cc @dongjoon-hyun @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once [spark]
davidm-db commented on code in PR #50027: URL: https://github.com/apache/spark/pull/50027#discussion_r1967153704 ## sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala: ## @@ -599,6 +599,116 @@ class CaseStatementExec( } } +/** + * Executable node for SimpleCaseStatement. + * @param caseVariableExec Statement with which all conditionExpressions will be compared to. + * @param conditionExpressions Collection of expressions which correspond to WHEN clauses. + * @param conditionalBodies Collection of executable bodies that have a corresponding condition, + * in WHEN branches. + * @param elseBody Body that is executed if none of the conditions are met, i.e. ELSE branch. + * @param session Spark session that SQL script is executed within. + * @param context SqlScriptingExecutionContext keeps the execution state of current script. + */ +class SimpleCaseStatementExec( +caseVariableExec: SingleStatementExec, +conditionExpressions: Seq[Expression], +conditionalBodies: Seq[CompoundBodyExec], +elseBody: Option[CompoundBodyExec], +session: SparkSession, +context: SqlScriptingExecutionContext) extends NonLeafStatementExec { + private object CaseState extends Enumeration { +val Condition, Body = Value + } + + private var state = CaseState.Condition + var bodyExec: Option[CompoundBodyExec] = None + + var conditionBodyTupleIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = _ + private var caseVariableLiteral: Literal = _ + + private var isCacheValid = false + private def validateCache(): Unit = { +if (!isCacheValid) { + val values = caseVariableExec.buildDataFrame(session).collect() + caseVariableExec.isExecuted = true + + caseVariableLiteral = Literal(values.head.get(0)) + conditionBodyTupleIterator = createConditionBodyIterator + isCacheValid = true +} + } + + private def cachedCaseVariableLiteral: Literal = { +validateCache() +caseVariableLiteral + } + + private def cachedConditionBodyIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = { +validateCache() +conditionBodyTupleIterator + } + + private lazy val treeIterator: Iterator[CompoundStatementExec] = +new Iterator[CompoundStatementExec] { + override def hasNext: Boolean = state match { +case CaseState.Condition => cachedConditionBodyIterator.hasNext || elseBody.isDefined +case CaseState.Body => bodyExec.exists(_.getTreeIterator.hasNext) + } + + override def next(): CompoundStatementExec = state match { +case CaseState.Condition => + cachedConditionBodyIterator.nextOption() +.map { case (condStmt, body) => + if (evaluateBooleanCondition(session, condStmt)) { +bodyExec = Some(body) +state = CaseState.Body + } + condStmt +} +.orElse(elseBody.map { body => { + bodyExec = Some(body) + state = CaseState.Body + next() +}}) +.get +case CaseState.Body => bodyExec.get.getTreeIterator.next() + } +} + + private def createConditionBodyIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = +conditionExpressions.zip(conditionalBodies) + .iterator + .map { case (expr, body) => +val condition = Project( + Seq(Alias(EqualTo(cachedCaseVariableLiteral, expr), "condition")()), + OneRowRelation() +) +// We hack the Origin to provide more descriptive error messages. For example, if +// the case variable is 1 and the condition expression it's compared to is 5, we +// will get Origin with text "(1 = 5)". +val conditionText = condition.projectList.head.asInstanceOf[Alias].child.toString +val condStmt = new SingleStatementExec( + condition, + Origin(sqlText = Some(conditionText), +startIndex = Some(0), +stopIndex = Some(conditionText.length - 1)), Review Comment: If it's an easy fix (and it looks like it is) I would definitely go with this approach (copying the line number to the hacked origin). If not, I think it's better to keep the line number and then improve the messaging later on as a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50785][SQL] Refactor FOR statement to utilize local variables properly. [spark]
cloud-fan commented on code in PR #50026: URL: https://github.com/apache/spark/pull/50026#discussion_r1967012051 ## sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala: ## @@ -206,6 +207,15 @@ class TriggerToExceptionHandlerMap( def getNotFoundHandler: Option[ExceptionHandlerExec] = notFoundHandler } +object TriggerToExceptionHandlerMap { + def empty: TriggerToExceptionHandlerMap = new TriggerToExceptionHandlerMap( Review Comment: let's be very cautious about any global state. Can we write detailed comment to explain how this global state is used and cleanup? Or try our best to not use this global state. ## sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala: ## @@ -206,6 +207,15 @@ class TriggerToExceptionHandlerMap( def getNotFoundHandler: Option[ExceptionHandlerExec] = notFoundHandler } +object TriggerToExceptionHandlerMap { + def empty: TriggerToExceptionHandlerMap = new TriggerToExceptionHandlerMap( Review Comment: let's be very cautious about any global state. Can we write detailed comment to explain how this global state is used and cleaned up? Or try our best to not use this global state. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
wangyum commented on PR #50022: URL: https://github.com/apache/spark/pull/50022#issuecomment-2677370449 @shrprasa Does it work after applying the patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]
dongjoon-hyun commented on code in PR #50040: URL: https://github.com/apache/spark/pull/50040#discussion_r1967023059 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala: ## @@ -839,4 +839,26 @@ class DataFrameWriterV2Suite extends QueryTest with SharedSparkSession with Befo condition = "CALL_ON_STREAMING_DATASET_UNSUPPORTED", parameters = Map("methodName" -> "`writeTo`")) } + + test("SPARK-51281: create/replace file source tables") { +Seq(true, false).foreach { ignorePath => + withSQLConf(SQLConf.LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION.key -> ignorePath.toString) { +withTable("t1", "t2") { + spark.range(10).writeTo("t1").using("json").create() + checkAnswer(spark.table("t1"), spark.range(10).toDF()) Review Comment: nit. `spark.range(10).toDF()` seems to be repeated three times. Shall we define one to reuse during this unit test? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]
dongjoon-hyun commented on code in PR #50040: URL: https://github.com/apache/spark/pull/50040#discussion_r1967023059 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala: ## @@ -839,4 +839,26 @@ class DataFrameWriterV2Suite extends QueryTest with SharedSparkSession with Befo condition = "CALL_ON_STREAMING_DATASET_UNSUPPORTED", parameters = Map("methodName" -> "`writeTo`")) } + + test("SPARK-51281: create/replace file source tables") { +Seq(true, false).foreach { ignorePath => + withSQLConf(SQLConf.LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION.key -> ignorePath.toString) { +withTable("t1", "t2") { + spark.range(10).writeTo("t1").using("json").create() + checkAnswer(spark.table("t1"), spark.range(10).toDF()) Review Comment: `spark.range(10).toDF()` seems to be repeated three times. Shall we define one to reuse during this unit test? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]
cloud-fan commented on code in PR #50040: URL: https://github.com/apache/spark/pull/50040#discussion_r1966991532 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -5545,6 +5545,15 @@ object SQLConf { .booleanConf .createWithDefault(false) + val LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION = +buildConf("spark.sql.legacy.dataFrameWriterV2IgnorePathOption") + .internal() + .doc("When set to true, DataFrameWriterV2 ignores the 'path' option and always write data " + +"to the default table location.") + .version("4.0.0") Review Comment: ```suggestion .version("3.5.5") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
pan3793 commented on code in PR #50022: URL: https://github.com/apache/spark/pull/50022#discussion_r1967007721 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1407,13 +1409,62 @@ private[hive] object HiveClientImpl extends Logging { case _ => new HiveConf(conf, classOf[HiveConf]) } -try { +val hive = try { Hive.getWithoutRegisterFns(hiveConf) } catch { // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but - // 2.3.8 don't), therefore here we fallback when encountering the exception. + // 2.3.8 doesn't), therefore here we fallback when encountering the exception. case _: NoSuchMethodError => Hive.get(hiveConf) } +configureMaxThriftMessageSize(hiveConf, hive.getMSC) +hive + } + + // SPARK-49489: a surgery for Hive 2.3.10 due to lack of HIVE-26633 + private def configureMaxThriftMessageSize( + hiveConf: HiveConf, msClient: IMetaStoreClient): Unit = try { +msClient match { + // Hive uses Java Dynamic Proxy to enhance the MetaStoreClient to support synchronization + // and retrying, we should unwrap and access the real MetaStoreClient instance firstly + case proxy if JdkProxy.isProxyClass(proxy.getClass) => +JdkProxy.getInvocationHandler(proxy) match { + case syncHandler if syncHandler.getClass.getName.endsWith("SynchronizedHandler") => +val realMscField = SparkClassUtils.classForName( Review Comment: @LuciferYang thanks for the review, I simplify the reflection call by following your suggestions. For concerns about the reflection code itself, I tune the code to make the added code only takes effect when the user configures `hive.thrift.client.max.message.size` explicitly, in case users compile the Spark with their own modified Hive version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51297][DOCS] Fixed the scope of the query option in sql-data-sources-jdbc.md [spark]
beliefer commented on PR #50048: URL: https://github.com/apache/spark/pull/50048#issuecomment-2677392603 Merged into branch-4.0/master @llphxd @HyukjinKwon @yaooqinn Thank you all! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SPARK-51299][SQL][Web UI] Filter out invalid metric values with initValue [spark]
jiwen624 opened a new pull request, #50055: URL: https://github.com/apache/spark/pull/50055 ### What changes were proposed in this pull request? This PR proposes to use `initValue` of a metric in `org.apache.spark.util.MetricUtils.stringValue` instead of a hardcoded init value to filter out invalid metric values. ### Why are the changes needed? In method `org.apache.spark.util.MetricUtils.stringValue`, it uses a hardcoded value 0 to filter out invalid metric values for SIZE_METRIC, TIMING_METRIC and NS_TIMING_METRIC: `val validValues = values.filter(_ >= 0) ` However, we offer methods to create these types of metrics with initValue other than -1 (introduced in this PR https://github.com/apache/spark/pull/41555) : `def createSizeMetric(sc: SparkContext, name: String, initValue: Long = -1): SQLMetric = { ` which means there is a chance that the metrics are created with a initValue != -1 and in this case the filter above will generate incorrect results. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
pan3793 commented on code in PR #50022: URL: https://github.com/apache/spark/pull/50022#discussion_r1967004950 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala: ## @@ -81,14 +82,18 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat * Due to classloader isolation issues, pattern matching won't work here so we need * to compare the canonical names of the exceptions, which we assume to be stable. */ - private def isClientException(e: Throwable): Boolean = { -var temp: Class[_] = e.getClass -var found = false -while (temp != null && !found) { - found = clientExceptions.contains(temp.getCanonicalName) - temp = temp.getSuperclass -} -found + @tailrec + private def isClientException(e: Throwable): Boolean = e match { +case re: RuntimeException if re.getCause != null => + isClientException(re.getCause) Review Comment: To cover `RuntimeException` throw by `Hive.getMSC` ``` Cause: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1742) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:83) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:133) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3607) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3659) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3639) at org.apache.spark.sql.hive.client.HiveClientImpl$.getHive(HiveClientImpl.scala:1420) at org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:269) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:236) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:235) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:285) at org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:420) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:192) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.scala:17) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:100) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:192) ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51294][SQL][CONNECT][TESTS] Improve the readability by split the variable of jars and configs for SparkConnectServerUtils. [spark]
beliefer commented on PR #50050: URL: https://github.com/apache/spark/pull/50050#issuecomment-2677381793 @HyukjinKwon Thank you very much! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51297][DOCS] Fixed the scope of the query option in sql-data-sources-jdbc.md [spark]
beliefer closed pull request #50048: [SPARK-51297][DOCS] Fixed the scope of the query option in sql-data-sources-jdbc.md URL: https://github.com/apache/spark/pull/50048 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51278][PYTHON] Use appropriate structure of JSON format for `PySparkLogger` [spark]
itholic commented on code in PR #50038: URL: https://github.com/apache/spark/pull/50038#discussion_r1967030513 ## python/pyspark/logger/logger.py: ## @@ -66,10 +67,21 @@ def format(self, record: logging.LogRecord) -> str: } if record.exc_info: exc_type, exc_value, exc_tb = record.exc_info +stacktrace = traceback.extract_tb(exc_tb) + +structured_stacktrace = [ +{ +"class": exc_type.__name__ if exc_type else "UnknownException", Review Comment: Makes sense. Just applied the suggestion. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]
HyukjinKwon commented on code in PR #50054: URL: https://github.com/apache/spark/pull/50054#discussion_r1967033629 ## python/docs/source/getting_started/install.rst: ## @@ -96,9 +96,7 @@ If you want to make Spark Connect default, you can install and additional librar It will automatically install ``pyspark`` library as well as dependencies that are necessary for Spark Connect. If you want to customize ``pyspark``, you need to install ``pyspark`` with the instructions above in advance. -Note that ``pyspark`` command will use ``--master`` option for Spark Connect remote URL instead of ``--remote`` option. - -See also `Quickstart: Spark Connect `_ for how to use it. +This package supports both ``spark.master`` (``--master``) with a locally running Spark Connect server, and ``spark.remote`` (``--remote``) including ``local*`` as well as connection URIs such as ``sc://localhost``. See also `Quickstart: Spark Connect `_ for how to use it. Review Comment: let me fix a little bit more -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]
HyukjinKwon commented on code in PR #50054: URL: https://github.com/apache/spark/pull/50054#discussion_r1967033498 ## python/docs/source/getting_started/install.rst: ## @@ -96,9 +96,7 @@ If you want to make Spark Connect default, you can install and additional librar It will automatically install ``pyspark`` library as well as dependencies that are necessary for Spark Connect. If you want to customize ``pyspark``, you need to install ``pyspark`` with the instructions above in advance. -Note that ``pyspark`` command will use ``--master`` option for Spark Connect remote URL instead of ``--remote`` option. - -See also `Quickstart: Spark Connect `_ for how to use it. +This package supports both ``spark.master`` (``--master``) with a locally running Spark Connect server, and ``spark.remote`` (``--remote``) including ``local*`` as well as connection URIs such as ``sc://localhost``. See also `Quickstart: Spark Connect `_ for how to use it. Review Comment: Actually I also meant local-cluster too :-).. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]
dongjoon-hyun commented on code in PR #50054: URL: https://github.com/apache/spark/pull/50054#discussion_r1967034041 ## python/docs/source/getting_started/install.rst: ## @@ -96,9 +96,7 @@ If you want to make Spark Connect default, you can install and additional librar It will automatically install ``pyspark`` library as well as dependencies that are necessary for Spark Connect. If you want to customize ``pyspark``, you need to install ``pyspark`` with the instructions above in advance. -Note that ``pyspark`` command will use ``--master`` option for Spark Connect remote URL instead of ``--remote`` option. - -See also `Quickstart: Spark Connect `_ for how to use it. +This package supports both ``spark.master`` (``--master``) with a locally running Spark Connect server, and ``spark.remote`` (``--remote``) including ``local*`` as well as connection URIs such as ``sc://localhost``. See also `Quickstart: Spark Connect `_ for how to use it. Review Comment: Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Clarify spark.remote and spark.master in pyspark-connect and pyspark-client at install.rst [spark]
dongjoon-hyun commented on code in PR #50054: URL: https://github.com/apache/spark/pull/50054#discussion_r1967030929 ## python/docs/source/getting_started/install.rst: ## @@ -96,9 +96,7 @@ If you want to make Spark Connect default, you can install and additional librar It will automatically install ``pyspark`` library as well as dependencies that are necessary for Spark Connect. If you want to customize ``pyspark``, you need to install ``pyspark`` with the instructions above in advance. -Note that ``pyspark`` command will use ``--master`` option for Spark Connect remote URL instead of ``--remote`` option. - -See also `Quickstart: Spark Connect `_ for how to use it. +This package supports both ``spark.master`` (``--master``) with a locally running Spark Connect server, and ``spark.remote`` (``--remote``) including ``local*`` as well as connection URIs such as ``sc://localhost``. See also `Quickstart: Spark Connect `_ for how to use it. Review Comment: Does `local*` mean `local[K,F]` syntax in this line, @HyukjinKwon ? If then, it looks a little confusing to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org