Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]
beliefer commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1973314920 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -122,23 +124,40 @@ private[spark] class BarrierCoordinator( // Init a TimerTask for a barrier() call. private def initTimerTask(state: ContextBarrierState): Unit = { timerTask = new TimerTask { -override def run(): Unit = state.synchronized { - // Timeout current barrier() call, fail all the sync requests. - requesters.foreach(_.sendFailure(new SparkException("The coordinator didn't get all " + -s"barrier sync requests for barrier epoch $barrierEpoch from $barrierId within " + -s"$timeoutInSecs second(s)."))) - cleanupBarrierStage(barrierId) -} +override def run(): Unit = + try { +state.synchronized { + if (!Thread.currentThread().isInterrupted()) { +// Timeout current barrier() call, fail all the sync requests. +requesters.foreach( + _.sendFailure(new SparkException("The coordinator didn't get all " + +s"barrier sync requests for barrier epoch +" + +s" $barrierEpoch from $barrierId within " + +s"$timeoutInSecs second(s)."))) +cleanupBarrierStage(barrierId) + } +} + } catch { +case _: InterruptedException => + // Handle interruption gracefully + Thread.currentThread().interrupt() +case e: Exception => new SparkException("Error during " + + s"running of barrier tasks for " + + s"$barrierId", e) + } finally { +// Ensure cleanup happens even if interrupted or exception occurs +cleanupBarrierStage(barrierId) +state.clear() + } } } -// Cancel the current active TimerTask and release resources. +/* Cancel the tasks scheduled to run inside the ScheduledExecutor Threadpool +* The original implementation was clearing java.util.Timer and java.util.TimerTasks +* This became a no-op when java.util.Timer was replaced with ScheduledThreadPoolExecutor +*/ private def cancelTimerTask(): Unit = { - if (timerTask != null) { -timerTask.cancel() Review Comment: timerTask.cancel() doesn't work ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51337][SQL] Add maxRows to CTERelationRef [spark]
vladimirg-db opened a new pull request, #50104: URL: https://github.com/apache/spark/pull/50104 ### What changes were proposed in this pull request? Add `maxRows` field to `CTERelationRef`. ### Why are the changes needed? The Analyzer validates scalar subqueries by checking if it outputs just one row or not: https://github.com/vladimirg-db/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ValidateSubqueryExpression.scala#L144. This works in fixed-point Analyzer, because CTEs are inlined before `CheckAnalysis`. However, in single-pass Analyzer, we validate the subquery right after its validation, so `CTERelationRef` must output correct `maxRows` as well. ### Does this PR introduce _any_ user-facing change? There should be no changes to existing Catalyst behavior. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51332][SQL] DS V2 supports push down BIT_AND, BIT_OR, BIT_XOR, BIT_COUNT and BIT_GET [spark]
beliefer commented on PR #50097: URL: https://github.com/apache/spark/pull/50097#issuecomment-2687554240 ping @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]
beliefer commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1973321076 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -80,8 +81,9 @@ private[spark] class BarrierCoordinator( states.forEachValue(1, clearStateConsumer) states.clear() listenerBus.removeListener(listener) - ThreadUtils.shutdown(timer) } finally { + timerFuture.foreach(_.cancel(true)) + ThreadUtils.shutdown(timer) Review Comment: I checked other code, `ThreadUtils.shutdown(...)` did not placed into finally block. Could you give me some explanation ? Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]
beliefer commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1973321076 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -80,8 +81,9 @@ private[spark] class BarrierCoordinator( states.forEachValue(1, clearStateConsumer) states.clear() listenerBus.removeListener(listener) - ThreadUtils.shutdown(timer) } finally { + timerFuture.foreach(_.cancel(true)) + ThreadUtils.shutdown(timer) Review Comment: I checked other code, `ThreadUtils.shutdown(...)` is not placed into finally block. Could you give me some explanation ? Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SQL] Add `TimeType` [spark]
MaxGekk opened a new pull request, #50103: URL: https://github.com/apache/spark/pull/50103 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly *DataTypeSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50849][Connect] Add example project to demonstrate Spark Connect Server Libraries [spark]
vicennial commented on PR #49604: URL: https://github.com/apache/spark/pull/49604#issuecomment-2687751626 Thanks for the comment @LuciferYang > I have some doubts about merging this part of the code into the main codebase, as it seems more like an independent project to me. It is independent, but this is due to the independence property of the Spark Connect Client (which is, by design, a thin client with lightweight dependencies). Since this serves to demonstrate extending Spark Connect and displaying its usage of the extension via a Spark Connect Client, I chose to mimic how a practical example would look, and thus, there is no direct connection to the parent Spark tree. > Additionally, I haven't found any process for building and testing it in any of Spark CI (if there is one, please point it out to me). What is the recognized way to ensure its integrity and health after we make changes to it? You are right, there is no automated process in the Spark CI at the moment. In an earlier [comment](https://issues.apache.org/jira/browse/SPARK-51338), I requested to separate out the addition of the automated build step in order to separate out and tackle the somewhat intricate complexity in modifying the CI in a fresh PR. I've opened up this ticket now to track this follow-up https://issues.apache.org/jira/browse/SPARK-51338. I am actively looking into the process to accomplish this step, albeit a bit slow as I haven't modified the CI in the past. > Additionally, in what form will this be released to users? This largely serves as a reference skeleton project structure (accompanied by the documentation in https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc1-docs/_site/app-dev-spark-connect.html). No intention atm to have users depend on this project but rather use it as a base for their own separate/independent development -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51340][ML][CONNECT] Model size estimation for linear classification & regression models [spark]
zhengruifeng opened a new pull request, #50106: URL: https://github.com/apache/spark/pull/50106 ### What changes were proposed in this pull request? Model size estimation for linear classification & regression models ### Why are the changes needed? pre-training model size estimation is required to control the model cache at driver ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51325] Check in source code for `smallJar.jar` [spark]
vicennial closed pull request #50092: [SPARK-51325] Check in source code for `smallJar.jar` URL: https://github.com/apache/spark/pull/50092 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51325] Check in source code for `smallJar.jar` [spark]
vicennial commented on PR #50092: URL: https://github.com/apache/spark/pull/50092#issuecomment-2687688931 Understood. Closing this PR and holding the code changes until we have a direction based on the dev list discussion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]
jayadeep-jayaraman commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1973338339 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -122,23 +124,40 @@ private[spark] class BarrierCoordinator( // Init a TimerTask for a barrier() call. private def initTimerTask(state: ContextBarrierState): Unit = { timerTask = new TimerTask { -override def run(): Unit = state.synchronized { - // Timeout current barrier() call, fail all the sync requests. - requesters.foreach(_.sendFailure(new SparkException("The coordinator didn't get all " + -s"barrier sync requests for barrier epoch $barrierEpoch from $barrierId within " + -s"$timeoutInSecs second(s)."))) - cleanupBarrierStage(barrierId) -} +override def run(): Unit = + try { +state.synchronized { + if (!Thread.currentThread().isInterrupted()) { +// Timeout current barrier() call, fail all the sync requests. +requesters.foreach( + _.sendFailure(new SparkException("The coordinator didn't get all " + +s"barrier sync requests for barrier epoch +" + +s" $barrierEpoch from $barrierId within " + +s"$timeoutInSecs second(s)."))) +cleanupBarrierStage(barrierId) + } +} + } catch { +case _: InterruptedException => + // Handle interruption gracefully + Thread.currentThread().interrupt() +case e: Exception => new SparkException("Error during " + + s"running of barrier tasks for " + + s"$barrierId", e) + } finally { +// Ensure cleanup happens even if interrupted or exception occurs +cleanupBarrierStage(barrierId) +state.clear() + } } } -// Cancel the current active TimerTask and release resources. +/* Cancel the tasks scheduled to run inside the ScheduledExecutor Threadpool +* The original implementation was clearing java.util.Timer and java.util.TimerTasks +* This became a no-op when java.util.Timer was replaced with ScheduledThreadPoolExecutor +*/ private def cancelTimerTask(): Unit = { - if (timerTask != null) { -timerTask.cancel() Review Comment: It does not work. The explanation is available [here](https://github.com/apache/spark/pull/47956#issuecomment-2328035086). Basically, the issue was we moved from `Timertask` API to `ScheduledThreadPoolExecutor` and the cancel method is no longer directly available as `ScheduledThreadPoolExecutor` returns a `ScheduledFuture` and we need to use [cancel](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/Future.html) method of `Future` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]
jayadeep-jayaraman commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1973345489 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -80,8 +81,9 @@ private[spark] class BarrierCoordinator( states.forEachValue(1, clearStateConsumer) states.clear() listenerBus.removeListener(listener) - ThreadUtils.shutdown(timer) } finally { + timerFuture.foreach(_.cancel(true)) + ThreadUtils.shutdown(timer) Review Comment: I moved it to `finally` to ensure that even if there are any failures in the previous cleanup steps we still shutdown the timer. But i can move it to the `try` block if it is a concern -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Add java/scala version in response [spark]
garlandz-db opened a new pull request, #50102: URL: https://github.com/apache/spark/pull/50102 ### What changes were proposed in this pull request? * piggyback off the spark version response and include other env properties like java/scala version ### Why are the changes needed? * client is not able to know what type of java/scala version the server uses which can be problematic with udfs and client has different build env than server ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]
jjayadeep06 commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1973391083 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -80,8 +81,9 @@ private[spark] class BarrierCoordinator( states.forEachValue(1, clearStateConsumer) states.clear() listenerBus.removeListener(listener) - ThreadUtils.shutdown(timer) } finally { + timerFuture.foreach(_.cancel(true)) + ThreadUtils.shutdown(timer) Review Comment: I moved it to `finally` to ensure that even if there are any failures in the previous cleanup steps we still shutdown the timer. But i can move it to the `try` block if it is a concern -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]
jayadeep-jayaraman commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1973338339 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -122,23 +124,40 @@ private[spark] class BarrierCoordinator( // Init a TimerTask for a barrier() call. private def initTimerTask(state: ContextBarrierState): Unit = { timerTask = new TimerTask { -override def run(): Unit = state.synchronized { - // Timeout current barrier() call, fail all the sync requests. - requesters.foreach(_.sendFailure(new SparkException("The coordinator didn't get all " + -s"barrier sync requests for barrier epoch $barrierEpoch from $barrierId within " + -s"$timeoutInSecs second(s)."))) - cleanupBarrierStage(barrierId) -} +override def run(): Unit = + try { +state.synchronized { + if (!Thread.currentThread().isInterrupted()) { +// Timeout current barrier() call, fail all the sync requests. +requesters.foreach( + _.sendFailure(new SparkException("The coordinator didn't get all " + +s"barrier sync requests for barrier epoch +" + +s" $barrierEpoch from $barrierId within " + +s"$timeoutInSecs second(s)."))) +cleanupBarrierStage(barrierId) + } +} + } catch { +case _: InterruptedException => + // Handle interruption gracefully + Thread.currentThread().interrupt() +case e: Exception => new SparkException("Error during " + + s"running of barrier tasks for " + + s"$barrierId", e) + } finally { +// Ensure cleanup happens even if interrupted or exception occurs +cleanupBarrierStage(barrierId) +state.clear() + } } } -// Cancel the current active TimerTask and release resources. +/* Cancel the tasks scheduled to run inside the ScheduledExecutor Threadpool +* The original implementation was clearing java.util.Timer and java.util.TimerTasks +* This became a no-op when java.util.Timer was replaced with ScheduledThreadPoolExecutor +*/ private def cancelTimerTask(): Unit = { - if (timerTask != null) { -timerTask.cancel() Review Comment: It does not work. The explanation is available [here](https://github.com/apache/spark/pull/47956#issuecomment-2328035086). Basically, the issue was we moved from `Timertask` API to `ScheduledThreadPoolExecutor` and the cancel method is no longer directly available as `ScheduledThreadPoolExecutor` returns a `ScheduledFuture` and we need to use [cancel](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/Future.html) method of `Future` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]
jjayadeep06 commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1973390612 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -122,23 +124,40 @@ private[spark] class BarrierCoordinator( // Init a TimerTask for a barrier() call. private def initTimerTask(state: ContextBarrierState): Unit = { timerTask = new TimerTask { -override def run(): Unit = state.synchronized { - // Timeout current barrier() call, fail all the sync requests. - requesters.foreach(_.sendFailure(new SparkException("The coordinator didn't get all " + -s"barrier sync requests for barrier epoch $barrierEpoch from $barrierId within " + -s"$timeoutInSecs second(s)."))) - cleanupBarrierStage(barrierId) -} +override def run(): Unit = + try { +state.synchronized { + if (!Thread.currentThread().isInterrupted()) { +// Timeout current barrier() call, fail all the sync requests. +requesters.foreach( + _.sendFailure(new SparkException("The coordinator didn't get all " + +s"barrier sync requests for barrier epoch +" + +s" $barrierEpoch from $barrierId within " + +s"$timeoutInSecs second(s)."))) +cleanupBarrierStage(barrierId) + } +} + } catch { +case _: InterruptedException => + // Handle interruption gracefully + Thread.currentThread().interrupt() +case e: Exception => new SparkException("Error during " + + s"running of barrier tasks for " + + s"$barrierId", e) + } finally { +// Ensure cleanup happens even if interrupted or exception occurs +cleanupBarrierStage(barrierId) +state.clear() + } } } -// Cancel the current active TimerTask and release resources. +/* Cancel the tasks scheduled to run inside the ScheduledExecutor Threadpool +* The original implementation was clearing java.util.Timer and java.util.TimerTasks +* This became a no-op when java.util.Timer was replaced with ScheduledThreadPoolExecutor +*/ private def cancelTimerTask(): Unit = { - if (timerTask != null) { -timerTask.cancel() Review Comment: It does not work. The explanation is available [here](https://github.com/apache/spark/pull/47956#issuecomment-2328035086). Basically, the issue was we moved from `Timertask` API to `ScheduledThreadPoolExecutor` and the cancel method is no longer directly available as `ScheduledThreadPoolExecutor` returns a `ScheduledFuture` and we need to use [cancel](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/Future.html) method of `Future` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]
jayadeep-jayaraman commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1973345489 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -80,8 +81,9 @@ private[spark] class BarrierCoordinator( states.forEachValue(1, clearStateConsumer) states.clear() listenerBus.removeListener(listener) - ThreadUtils.shutdown(timer) } finally { + timerFuture.foreach(_.cancel(true)) + ThreadUtils.shutdown(timer) Review Comment: I moved it to `finally` to ensure that even if there are any failures in the previous cleanup steps we still shutdown the timer. But i can move it to the `try` block if it is a concern -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51339] Remove `IllegalImportsChecker` for `scala.collection.Seq/IndexedSeq` from `scalastyle-config.xml` [spark]
LuciferYang opened a new pull request, #50105: URL: https://github.com/apache/spark/pull/50105 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]
cloud-fan commented on code in PR #50040: URL: https://github.com/apache/spark/pull/50040#discussion_r1973575824 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala: ## @@ -841,20 +841,24 @@ class DataFrameWriterV2Suite extends QueryTest with SharedSparkSession with Befo } test("SPARK-51281: create/replace file source tables") { +def checkResults(df: DataFrame): Unit = { + checkAnswer(df, spark.range(10).toDF()) +} Review Comment: let's not over-engineering. `spark.range(10)` is already very short. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]
cloud-fan commented on PR #50040: URL: https://github.com/apache/spark/pull/50040#issuecomment-2687956522 thanks for the review, merging to master/4.0/3.5! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]
cloud-fan closed pull request #50040: [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option URL: https://github.com/apache/spark/pull/50040 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]
beliefer commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1973575314 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -122,23 +124,40 @@ private[spark] class BarrierCoordinator( // Init a TimerTask for a barrier() call. private def initTimerTask(state: ContextBarrierState): Unit = { timerTask = new TimerTask { -override def run(): Unit = state.synchronized { - // Timeout current barrier() call, fail all the sync requests. - requesters.foreach(_.sendFailure(new SparkException("The coordinator didn't get all " + -s"barrier sync requests for barrier epoch $barrierEpoch from $barrierId within " + -s"$timeoutInSecs second(s)."))) - cleanupBarrierStage(barrierId) -} +override def run(): Unit = + try { +state.synchronized { + if (!Thread.currentThread().isInterrupted()) { +// Timeout current barrier() call, fail all the sync requests. +requesters.foreach( + _.sendFailure(new SparkException("The coordinator didn't get all " + +s"barrier sync requests for barrier epoch +" + +s" $barrierEpoch from $barrierId within " + +s"$timeoutInSecs second(s)."))) +cleanupBarrierStage(barrierId) + } +} + } catch { +case _: InterruptedException => + // Handle interruption gracefully + Thread.currentThread().interrupt() +case e: Exception => new SparkException("Error during " + + s"running of barrier tasks for " + + s"$barrierId", e) + } finally { +// Ensure cleanup happens even if interrupted or exception occurs +cleanupBarrierStage(barrierId) +state.clear() + } } } -// Cancel the current active TimerTask and release resources. +/* Cancel the tasks scheduled to run inside the ScheduledExecutor Threadpool +* The original implementation was clearing java.util.Timer and java.util.TimerTasks +* This became a no-op when java.util.Timer was replaced with ScheduledThreadPoolExecutor +*/ private def cancelTimerTask(): Unit = { - if (timerTask != null) { -timerTask.cancel() Review Comment: Got it. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]
beliefer commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1973607118 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -122,23 +124,40 @@ private[spark] class BarrierCoordinator( // Init a TimerTask for a barrier() call. private def initTimerTask(state: ContextBarrierState): Unit = { timerTask = new TimerTask { -override def run(): Unit = state.synchronized { - // Timeout current barrier() call, fail all the sync requests. - requesters.foreach(_.sendFailure(new SparkException("The coordinator didn't get all " + -s"barrier sync requests for barrier epoch $barrierEpoch from $barrierId within " + -s"$timeoutInSecs second(s)."))) - cleanupBarrierStage(barrierId) -} +override def run(): Unit = + try { +state.synchronized { + if (!Thread.currentThread().isInterrupted()) { +// Timeout current barrier() call, fail all the sync requests. +requesters.foreach( + _.sendFailure(new SparkException("The coordinator didn't get all " + +s"barrier sync requests for barrier epoch +" + +s" $barrierEpoch from $barrierId within " + +s"$timeoutInSecs second(s)."))) +cleanupBarrierStage(barrierId) + } +} + } catch { +case _: InterruptedException => + // Handle interruption gracefully + Thread.currentThread().interrupt() +case e: Exception => new SparkException("Error during " + + s"running of barrier tasks for " + + s"$barrierId", e) + } finally { +// Ensure cleanup happens even if interrupted or exception occurs +cleanupBarrierStage(barrierId) +state.clear() + } Review Comment: Why do we add these block? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51341][SQL] Cancel time task with suitable way. [spark]
beliefer opened a new pull request, #50107: URL: https://github.com/apache/spark/pull/50107 ### What changes were proposed in this pull request? This PR proposes to cancel task with suitable way. ### Why are the changes needed? According to the discussion at https://github.com/apache/spark/pull/47956#issuecomment-2328035086 TimerTask.cancel() doesn't work. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA ### Was this patch authored or co-authored using generative AI tooling? 'No'. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49756][SQL][FOLLOWUP] Use correct pgsql datetime fields when pushing down EXTRACT [spark]
cloud-fan commented on PR #50101: URL: https://github.com/apache/spark/pull/50101#issuecomment-2688008852 and also cc @yaooqinn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51182][SQL] DataFrameWriter should throw dataPathNotSpecifiedError when path is not specified [spark]
cloud-fan commented on PR #49928: URL: https://github.com/apache/spark/pull/49928#issuecomment-2688013194 The change LGTM, can we add a test in `DataFrameReaderWriterSuite`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48375][SQL] Add support for SIGNAL statement [spark]
cloud-fan commented on PR #49726: URL: https://github.com/apache/spark/pull/49726#issuecomment-2688007524 Can we implement the SIGNAL statement as a `SELECT raise_error(...)`? Then it's consistent that every scripting statement generates a DataFrame. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51299][SQL][UI] MetricUtils.stringValue should filter metric values with initValue rather than a hardcoded value [spark]
cloud-fan commented on PR #50055: URL: https://github.com/apache/spark/pull/50055#issuecomment-2688019216 @jiwen624 I think you are right, do you have a concrete metric that has the problem? For metrics that we only need a sum, `0` is OK as the initial value and we don't filter it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51307][SQL] locationUri in CatalogStorageFormat shall be decoded for display [spark]
cloud-fan commented on code in PR #50074: URL: https://github.com/apache/spark/pull/50074#discussion_r1973628519 ## sql/core/src/test/resources/sql-tests/results/describe.sql.out: ## @@ -890,6 +890,48 @@ a string CONCAT('a\n b\n ', 'c\n d') b int 42 +-- !query +CREATE TABLE f PARTITIONED BY (B, C) AS SELECT 'APACHE' A, CAST('SPARK' AS BINARY) B, TIMESTAMP'2018-11-17 13:33:33' C +-- !query schema +struct<> +-- !query output + + + +-- !query +DESC FORMATTED f PARTITION (B='SPARK', C=TIMESTAMP'2018-11-17 13:33:33') +-- !query schema +struct +-- !query output +A string +B binary +C timestamp +# Partition Information +# col_name data_type comment +B binary +C timestamp + +# Detailed Partition Information +Database default +Table f +Partition Values [B=SPARK, C=2018-11-17 13:33:33] +Location [not included in comparison]/{warehouse_dir}/f/B=SPARK/C=2018-11-17 13%3A33%3A33 +Partition Parameters [numFiles=1, totalSize=486, transient_lastDdlTime=[not included in comparison]] Review Comment: we have `transient_lastDdlTime` because it's a hive table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51307][SQL] locationUri in CatalogStorageFormat shall be decoded for display [spark]
cloud-fan commented on code in PR #50074: URL: https://github.com/apache/spark/pull/50074#discussion_r1973633764 ## sql/core/src/test/resources/sql-tests/results/describe.sql.out: ## @@ -890,6 +890,48 @@ a string CONCAT('a\n b\n ', 'c\n d') b int 42 +-- !query +CREATE TABLE f PARTITIONED BY (B, C) AS SELECT 'APACHE' A, CAST('SPARK' AS BINARY) B, TIMESTAMP'2018-11-17 13:33:33' C +-- !query schema +struct<> +-- !query output + + + +-- !query +DESC FORMATTED f PARTITION (B='SPARK', C=TIMESTAMP'2018-11-17 13:33:33') +-- !query schema +struct +-- !query output +A string +B binary +C timestamp +# Partition Information +# col_name data_type comment +B binary +C timestamp + +# Detailed Partition Information +Database default +Table f +Partition Values [B=SPARK, C=2018-11-17 13:33:33] +Location [not included in comparison]/{warehouse_dir}/f/B=SPARK/C=2018-11-17 13%3A33%3A33 +Partition Parameters [numFiles=1, totalSize=486, transient_lastDdlTime=[not included in comparison]] Review Comment: Maybe we can test with `... AS JSON` to avoid seeing these unrelated fields. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51342][SQL] Add `TimeType` [spark]
the-sakthi commented on code in PR #50103: URL: https://github.com/apache/spark/pull/50103#discussion_r1974366151 ## common/utils/src/main/resources/error/error-conditions.json: ## @@ -6190,6 +6190,12 @@ ], "sqlState" : "42000" }, + "UNSUPPORTED_TIME_PRECISION" : { Review Comment: Nit: UNSUPPORTED_TIME_SECONDS_PRECISION ? Though the precision part comes only from the seconds in the time, so I guess it's fine either way? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51326][CONNECT][4.0] Remove LazyExpression proto message [spark]
ueshin commented on PR #50094: URL: https://github.com/apache/spark/pull/50094#issuecomment-263482 Thanks! merging to branch-4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51342][SQL] Add `TimeType` [spark]
the-sakthi commented on PR #50103: URL: https://github.com/apache/spark/pull/50103#issuecomment-2689165941 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51339][BUILD] Remove `IllegalImportsChecker` for `s.c.Seq/IndexedSeq` from `scalastyle-config.xml` [spark]
the-sakthi commented on PR #50105: URL: https://github.com/apache/spark/pull/50105#issuecomment-2689144092 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51344] Fix `ENV` key value format in `*.template` [spark-docker]
dongjoon-hyun commented on PR #82: URL: https://github.com/apache/spark-docker/pull/82#issuecomment-2689182775 Thank you, @viirya ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51344] Fix `ENV` key value format in `*.template` [spark-docker]
dongjoon-hyun closed pull request #82: [SPARK-51344] Fix `ENV` key value format in `*.template` URL: https://github.com/apache/spark-docker/pull/82 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51270][SQL] Support UUID type in Variant [spark]
gene-db commented on code in PR #50025: URL: https://github.com/apache/spark/pull/50025#discussion_r1974322433 ## common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java: ## @@ -240,6 +242,18 @@ public void appendBinary(byte[] binary) { writePos += binary.length; } + public void appendUuid(UUID uuid) { Review Comment: Is there a way to test this `appendUuid` api? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
luben commented on code in PR #50057: URL: https://github.com/apache/spark/pull/50057#discussion_r1974306557 ## core/benchmarks/ZStandardBenchmark-jdk21-results.txt: ## @@ -2,48 +2,48 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool656 668 12 0.0 65591.6 1.0X -Compression 1 times at level 2 without buffer pool709 711 2 0.0 70934.6 0.9X -Compression 1 times at level 3 without buffer pool814 818 5 0.0 81370.9 0.8X -Compression 1 times at level 1 with buffer pool 601 603 2 0.0 60100.1 1.1X -Compression 1 times at level 2 with buffer pool 634 636 2 0.0 63449.9 1.0X -Compression 1 times at level 3 with buffer pool 748 753 5 0.0 74789.7 0.9X +Compression 1 times at level 1 without buffer pool650 666 11 0.0 65044.6 1.0X +Compression 1 times at level 2 without buffer pool701 703 2 0.0 70071.2 0.9X +Compression 1 times at level 3 without buffer pool804 807 2 0.0 80430.6 0.8X +Compression 1 times at level 1 with buffer pool 598 599 1 0.0 59810.5 1.1X +Compression 1 times at level 2 with buffer pool 631 640 17 0.0 63096.7 1.0X +Compression 1 times at level 3 with buffer pool 751 753 4 0.0 75051.1 0.9X -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Decompression 1 times from level 1 without buffer pool817 818 1 0.0 81723.2 1.0X -Decompression 1 times from level 2 without buffer pool817 818 1 0.0 81729.4 1.0X -Decompression 1 times from level 3 without buffer pool817 818 1 0.0 81719.7 1.0X -Decompression 1 times from level 1 with buffer pool 749 757 14 0.0 74864.9 1.1X -Decompression 1 times from level 2 with buffer pool 748 749 1 0.0 74789.1 1.1X -Decompression 1 times from level 3 with buffer pool 748 749 1 0.0 74811.6 1.1X +Decompression 1 times from level 1 without buffer pool823 831 14 0.0 82254.2 1.0X +Decompression 1 times from level 2 without buffer pool824 827 5 0.0 82359.1 1.0X +Decompression 1 times from level 3 without buffer pool819 821 2 0.0 81925.1 1.0X +Decompression 1 times from level 1 with buffer pool 753 755 3 0.0 75284.1 1.1X +Decompression 1 times from level 2 with buffer pool 753 755 1 0.0 75344.7 1.1X +Decompression 1 times from level 3 with buffer pool 753 754 1 0.0 75263.0 1.1X -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Parallel Compression at level 3: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Parallel Compression w
Re: [PR] [SPARK-51326][CONNECT][4.0] Remove LazyExpression proto message [spark]
ueshin commented on PR #50094: URL: https://github.com/apache/spark/pull/50094#issuecomment-262567 The remaining test failures are not related to this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51278][FOLLOWUP][DOCS] Update JSON format from documentation [spark]
the-sakthi commented on PR #50100: URL: https://github.com/apache/spark/pull/50100#issuecomment-2689210402 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51270][SQL] Support UUID type in Variant [spark]
cashmand commented on code in PR #50025: URL: https://github.com/apache/spark/pull/50025#discussion_r1974435292 ## common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java: ## @@ -240,6 +242,18 @@ public void appendBinary(byte[] binary) { writePos += binary.length; } + public void appendUuid(UUID uuid) { Review Comment: Good catch, I hadn't noticed that this function wasn't tested. I modified one of the tests I added to use VariantBuilder to construct its UUID, so that it uses this function. This exposed a bug - when I switched to using `ByteBuffer.wrap`, I didn't update the indices that I wrote the UUID to, so it was always writing them at index 0 of the full value array. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50855][SS][CONNECT] Spark Connect Support for TransformWithState In Scala [spark]
jingz-db commented on code in PR #49488: URL: https://github.com/apache/spark/pull/49488#discussion_r1974452111 ## sql/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/streaming/TransformWithStateConnectSuite.scala: ## Review Comment: > I've realized that you just ported the tests from python to Scala Exactly. I am porting python tests as connect Scala tests because Python tests are essentially end to end tests as well. > without leveraging better test framework. Would we just make this test suite be concise as much as possible rather than blind port? Unfortunately `StreamTest` and `MemoryStream` are not visible from connect/clinet/jvm directory we are currently using for connect related suites. You may find all our other streaming tests are also end to end tests and did not make use of the StreamTest framework as well: https://github.com/apache/spark/blob/master/sql/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/streaming/FlatMapGroupsWithStateStreamingSuite.scala#L36 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50855][SS][CONNECT] Spark Connect Support for TransformWithState In Scala [spark]
jingz-db commented on code in PR #49488: URL: https://github.com/apache/spark/pull/49488#discussion_r1974458075 ## sql/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/streaming/TransformWithStateConnectSuite.scala: ## @@ -0,0 +1,489 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connect.streaming + +import java.io.{BufferedWriter, FileWriter} +import java.nio.file.{Files, Paths} +import java.sql.Timestamp + +import org.scalatest.concurrent.Eventually.eventually +import org.scalatest.concurrent.Futures.timeout +import org.scalatest.time.SpanSugar._ + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{DataFrame, Dataset, Encoders, Row} +import org.apache.spark.sql.connect.SparkSession +import org.apache.spark.sql.connect.test.{QueryTest, RemoteSparkSession} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.streaming.{ListState, MapState, OutputMode, StatefulProcessor, StatefulProcessorWithInitialState, TimeMode, TimerValues, TTLConfig, ValueState} +import org.apache.spark.sql.types._ + +case class InputRowForConnectTest(key: String, value: String) +case class OutputRowForConnectTest(key: String, value: String) +case class StateRowForConnectTest(count: Long) + +// A basic stateful processor which will return the occurrences of key +class BasicCountStatefulProcessor +extends StatefulProcessor[String, InputRowForConnectTest, OutputRowForConnectTest] +with Logging { + @transient protected var _countState: ValueState[StateRowForConnectTest] = _ + + override def init(outputMode: OutputMode, timeMode: TimeMode): Unit = { +_countState = getHandle.getValueState[StateRowForConnectTest]( + "countState", + Encoders.product[StateRowForConnectTest], + TTLConfig.NONE) + } + + override def handleInputRows( + key: String, + inputRows: Iterator[InputRowForConnectTest], + timerValues: TimerValues): Iterator[OutputRowForConnectTest] = { +val count = + _countState.getOption().getOrElse(StateRowForConnectTest(0L)).count + inputRows.toSeq.length +_countState.update(StateRowForConnectTest(count)) +Iterator(OutputRowForConnectTest(key, count.toString)) + } +} + +// A stateful processor with initial state which will return the occurrences of key +class TestInitialStatefulProcessor +extends StatefulProcessorWithInitialState[ + String, + (String, String), + (String, String), + (String, String, String)] +with Logging { + @transient protected var _countState: ValueState[Long] = _ + + override def init(outputMode: OutputMode, timeMode: TimeMode): Unit = { +_countState = getHandle.getValueState[Long]("countState", Encoders.scalaLong, TTLConfig.NONE) + } + + override def handleInputRows( + key: String, + inputRows: Iterator[(String, String)], + timerValues: TimerValues): Iterator[(String, String)] = { +val count = _countState.getOption().getOrElse(0L) + inputRows.toSeq.length +_countState.update(count) +Iterator((key, count.toString)) + } + + override def handleInitialState( + key: String, + initialState: (String, String, String), + timerValues: TimerValues): Unit = { +val count = _countState.getOption().getOrElse(0L) + 1 +_countState.update(count) + } +} + +case class OutputEventTimeRow(key: String, outputTimestamp: Timestamp) + +// A stateful processor which will return timestamp of the first item from input rows +class ChainingOfOpsStatefulProcessor +extends StatefulProcessor[String, (String, Timestamp), OutputEventTimeRow] { + override def init(outputMode: OutputMode, timeMode: TimeMode): Unit = {} + + override def handleInputRows( + key: String, + inputRows: Iterator[(String, Timestamp)], + timerValues: TimerValues): Iterator[OutputEventTimeRow] = { +val timestamp = inputRows.next()._2 +Iterator(OutputEventTimeRow(key, timestamp)) + } +} + +// A basic stateful processor contains composite state variables and TTL +class TTLTestStatefulProcessor +extends StatefulProcessor[String, (String, String), (String, String)] { + import java.time.Duration + + @transient protected var countSt
Re: [PR] [SPARK-50855][SS][CONNECT] Spark Connect Support for TransformWithState In Scala [spark]
jingz-db commented on code in PR #49488: URL: https://github.com/apache/spark/pull/49488#discussion_r1974452111 ## sql/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/streaming/TransformWithStateConnectSuite.scala: ## Review Comment: > I've realized that you just ported the tests from python to Scala Exactly. I am porting python tests as connect Scala tests because Python tests are essentially end to end tests as well. > without leveraging better test framework. Would we just make this test suite be concise as much as possible rather than blind port? Unfortunately `StreamTest` and `MemoryStream` are not visible from connect/clinet/jvm directory we are currently using for connect related suites. You may find our other streaming tests are also end to end tests and did not make use of the StreamTest framework as well: https://github.com/apache/spark/blob/master/sql/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/streaming/FlatMapGroupsWithStateStreamingSuite.scala#L36 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51337][SQL] Add maxRows to CTERelationDef and CTERelationRef [spark]
cloud-fan closed pull request #50104: [SPARK-51337][SQL] Add maxRows to CTERelationDef and CTERelationRef URL: https://github.com/apache/spark/pull/50104 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51337][SQL] Add maxRows to CTERelationDef and CTERelationRef [spark]
cloud-fan commented on PR #50104: URL: https://github.com/apache/spark/pull/50104#issuecomment-2689469635 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51303] [SQL] [TESTS] Extend `ORDER BY` testing coverage [spark]
cloud-fan commented on PR #50069: URL: https://github.com/apache/spark/pull/50069#issuecomment-2689447332 This is a test only PR and other test failures are definitely unrelated. Thanks, merging to master/4.0! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51303] [SQL] [TESTS] Extend `ORDER BY` testing coverage [spark]
cloud-fan closed pull request #50069: [SPARK-51303] [SQL] [TESTS] Extend `ORDER BY` testing coverage URL: https://github.com/apache/spark/pull/50069 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]
wengh commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1974559481 ## python/pyspark/sql/tests/test_python_datasource.py: ## @@ -246,6 +248,137 @@ def reader(self, schema) -> "DataSourceReader": assertDataFrameEqual(df, [Row(x=0, y="0"), Row(x=1, y="1")]) self.assertEqual(df.select(spark_partition_id()).distinct().count(), 2) +def test_filter_pushdown(self): +class TestDataSourceReader(DataSourceReader): +def __init__(self): +self.has_filter = False + +def pushdownFilters(self, filters: List[Filter]) -> Iterable[Filter]: +assert set(filters) == { +EqualTo(("x",), 1), +EqualTo(("y",), 2), +}, filters +self.has_filter = True +# pretend we support x = 1 filter but in fact we don't +# so we only return y = 2 filter +yield filters[filters.index(EqualTo(("y",), 2))] + +def partitions(self): +assert self.has_filter +return super().partitions() + +def read(self, partition): +assert self.has_filter +yield [1, 1] +yield [1, 2] +yield [2, 2] + +class TestDataSource(DataSource): +@classmethod +def name(cls): +return "test" + +def schema(self): +return "x int, y int" + +def reader(self, schema) -> "DataSourceReader": +return TestDataSourceReader() + +self.spark.dataSource.register(TestDataSource) +df = self.spark.read.format("test").load().filter("x = 1 and y = 2") Review Comment: added pushed filters to the physical plan description and a scala test for the physical plan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]
wengh commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1974559481 ## python/pyspark/sql/tests/test_python_datasource.py: ## @@ -246,6 +248,137 @@ def reader(self, schema) -> "DataSourceReader": assertDataFrameEqual(df, [Row(x=0, y="0"), Row(x=1, y="1")]) self.assertEqual(df.select(spark_partition_id()).distinct().count(), 2) +def test_filter_pushdown(self): +class TestDataSourceReader(DataSourceReader): +def __init__(self): +self.has_filter = False + +def pushdownFilters(self, filters: List[Filter]) -> Iterable[Filter]: +assert set(filters) == { +EqualTo(("x",), 1), +EqualTo(("y",), 2), +}, filters +self.has_filter = True +# pretend we support x = 1 filter but in fact we don't +# so we only return y = 2 filter +yield filters[filters.index(EqualTo(("y",), 2))] + +def partitions(self): +assert self.has_filter +return super().partitions() + +def read(self, partition): +assert self.has_filter +yield [1, 1] +yield [1, 2] +yield [2, 2] + +class TestDataSource(DataSource): +@classmethod +def name(cls): +return "test" + +def schema(self): +return "x int, y int" + +def reader(self, schema) -> "DataSourceReader": +return TestDataSourceReader() + +self.spark.dataSource.register(TestDataSource) +df = self.spark.read.format("test").load().filter("x = 1 and y = 2") Review Comment: added the description and a scala test for this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50855][SS][CONNECT] Spark Connect Support for TransformWithState In Scala [spark]
jingz-db commented on code in PR #49488: URL: https://github.com/apache/spark/pull/49488#discussion_r1974088879 ## sql/connect/common/src/main/protobuf/spark/connect/relations.proto: ## Review Comment: Yes, Scala & Python is sharing the same connect client protocol. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51278][FOLLOWUP][DOCS] Update JSON format from documentation [spark]
HyukjinKwon closed pull request #50100: [SPARK-51278][FOLLOWUP][DOCS] Update JSON format from documentation URL: https://github.com/apache/spark/pull/50100 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51278][FOLLOWUP][DOCS] Update JSON format from documentation [spark]
HyukjinKwon commented on PR #50100: URL: https://github.com/apache/spark/pull/50100#issuecomment-2689394019 Merged to master and branch-4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Dev/milast/recurisve cte [spark]
github-actions[bot] closed pull request #48878: Dev/milast/recurisve cte URL: https://github.com/apache/spark/pull/48878 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51335] Publish Apache Spark 3.5.5 to docker registry [spark-docker]
dongjoon-hyun commented on PR #80: URL: https://github.com/apache/spark-docker/pull/80#issuecomment-2689111825 Thank you for the pointer, @pan3793 . Let me try in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51344] Fix `ENV` key value format in `*.template` [spark-docker]
dongjoon-hyun opened a new pull request, #82: URL: https://github.com/apache/spark-docker/pull/82 ### What changes were proposed in this pull request? This PR aims to fix `ENV` key value format in `*.template`. ### Why are the changes needed? To follow the Docker guideline to fix the following legacy format. - https://docs.docker.com/reference/build-checks/legacy-key-value-format/ ``` - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format ``` ### Does this PR introduce _any_ user-facing change? No. This removes a warning for now. ### How was this patch tested? Manual review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51298][WIP] Support variant in CSV scan [spark]
the-sakthi commented on PR #50052: URL: https://github.com/apache/spark/pull/50052#issuecomment-2689223159 Hello @chenhao-db thanks for the changes, could we please get some description in the jira and the response to the PR template quetsions above? Helps a lot with understanding the context for reviewing the PR. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51345] Remove all EOL versions from `spark-docker` repository [spark-docker]
dongjoon-hyun opened a new pull request, #83: URL: https://github.com/apache/spark-docker/pull/83 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51345] Remove all EOL versions from `spark-docker` repository [spark-docker]
dongjoon-hyun commented on PR #83: URL: https://github.com/apache/spark-docker/pull/83#issuecomment-2689220984 cc @Yikun , @yaooqinn , @LuciferYang , @yaooqinn , @itholic , @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51288][DOCS] Add link for Scala API of Spark Connect [spark]
the-sakthi commented on PR #50042: URL: https://github.com/apache/spark/pull/50042#issuecomment-2689227610 Nice. LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]
wengh commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1974189473 ## python/pyspark/sql/tests/test_python_datasource.py: ## @@ -246,6 +248,137 @@ def reader(self, schema) -> "DataSourceReader": assertDataFrameEqual(df, [Row(x=0, y="0"), Row(x=1, y="1")]) self.assertEqual(df.select(spark_partition_id()).distinct().count(), 2) +def test_filter_pushdown(self): +class TestDataSourceReader(DataSourceReader): +def __init__(self): +self.has_filter = False + +def pushdownFilters(self, filters: List[Filter]) -> Iterable[Filter]: +assert set(filters) == { +EqualTo(("x",), 1), +EqualTo(("y",), 2), +}, filters +self.has_filter = True +# pretend we support x = 1 filter but in fact we don't +# so we only return y = 2 filter +yield filters[filters.index(EqualTo(("y",), 2))] + +def partitions(self): +assert self.has_filter +return super().partitions() + +def read(self, partition): +assert self.has_filter +yield [1, 1] +yield [1, 2] +yield [2, 2] + +class TestDataSource(DataSource): +@classmethod +def name(cls): +return "test" + +def schema(self): +return "x int, y int" + +def reader(self, schema) -> "DataSourceReader": +return TestDataSourceReader() + +self.spark.dataSource.register(TestDataSource) +df = self.spark.read.format("test").load().filter("x = 1 and y = 2") Review Comment: like this ``` *(1) Project [x#296, y#297] +- *(1) Filter ((isnotnull(x#296) AND isnotnull(y#297)) AND (y#297 = 2)) +- BatchScan test[x#296, y#297] (Python) PushedFilters: [EqualTo(x,1)], ReadSchema: struct, ShortName: test RuntimeFilters: [] ``` ## python/pyspark/sql/tests/test_python_datasource.py: ## @@ -246,6 +248,137 @@ def reader(self, schema) -> "DataSourceReader": assertDataFrameEqual(df, [Row(x=0, y="0"), Row(x=1, y="1")]) self.assertEqual(df.select(spark_partition_id()).distinct().count(), 2) +def test_filter_pushdown(self): +class TestDataSourceReader(DataSourceReader): +def __init__(self): +self.has_filter = False + +def pushdownFilters(self, filters: List[Filter]) -> Iterable[Filter]: +assert set(filters) == { +EqualTo(("x",), 1), +EqualTo(("y",), 2), +}, filters +self.has_filter = True +# pretend we support x = 1 filter but in fact we don't +# so we only return y = 2 filter +yield filters[filters.index(EqualTo(("y",), 2))] + +def partitions(self): +assert self.has_filter +return super().partitions() + +def read(self, partition): +assert self.has_filter +yield [1, 1] +yield [1, 2] +yield [2, 2] + +class TestDataSource(DataSource): +@classmethod +def name(cls): +return "test" + +def schema(self): +return "x int, y int" + +def reader(self, schema) -> "DataSourceReader": +return TestDataSourceReader() + +self.spark.dataSource.register(TestDataSource) +df = self.spark.read.format("test").load().filter("x = 1 and y = 2") Review Comment: like this ``` *(1) Project [x#296, y#297] +- *(1) Filter ((isnotnull(x#296) AND isnotnull(y#297)) AND (y#297 = 2)) +- BatchScan test[x#296, y#297] (Python) PushedFilters: [EqualTo(x,1)], ReadSchema: struct RuntimeFilters: [] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51345] Remove all EOL versions from `spark-docker` repository [spark-docker]
dongjoon-hyun commented on PR #83: URL: https://github.com/apache/spark-docker/pull/83#issuecomment-2689239268 Thank you, @viirya . No~ AFAIK, the published docker images will exist like the published Maven jars. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51326][CONNECT] Remove LazyExpression proto message [spark]
ueshin closed pull request #50093: [SPARK-51326][CONNECT] Remove LazyExpression proto message URL: https://github.com/apache/spark/pull/50093 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51326][CONNECT] Remove LazyExpression proto message [spark]
ueshin commented on PR #50093: URL: https://github.com/apache/spark/pull/50093#issuecomment-2688892681 I reran the compatibility test after #50094 was merged and it passed. - https://github.com/ueshin/apache-spark/actions/runs/13554631961/job/37946210197 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51347] Enable Ingress and Service Support for Spark Driver [spark-kubernetes-operator]
jiangzho opened a new pull request, #159: URL: https://github.com/apache/spark-kubernetes-operator/pull/159 ### What changes were proposed in this pull request? This PR adds support for launching Ingress and Services with Spark Applications. ### Why are the changes needed? There's common ask from user to support exposing driver via ingress and service to serve HTTP / HTTPS from outside the cluster - for example, to access Spark UI. As we support podTemplateSupport already, this feature can also be handy for those who would like to expose additional service ports in the same pod. Therefore, this PR aims to introduce this feature in a more general way instead of just exposing Spark UI. ### Does this PR introduce _any_ user-facing change? No - unreleased version ### How was this patch tested? CIs ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51322][SQL] Better error message for streaming subquery expression [spark]
cloud-fan closed pull request #50088: [SPARK-51322][SQL] Better error message for streaming subquery expression URL: https://github.com/apache/spark/pull/50088 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51322][SQL] Better error message for streaming subquery expression [spark]
cloud-fan commented on PR #50088: URL: https://github.com/apache/spark/pull/50088#issuecomment-2687357950 thanks for the review, merging to master/4.0! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-49756][SQL][FOLLOWUP] Use correct pgsql datetime fields when pushing down EXTRACT [spark]
cloud-fan opened a new pull request, #50101: URL: https://github.com/apache/spark/pull/50101 ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/48210 to fix correctness issues caused by pgsql filter pushdown. These datetime fields were picked wrongly before, see https://neon.tech/postgresql/postgresql-date-functions/postgresql-extract ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, query result is corrected, but this bug is not released yet. ### How was this patch tested? updated test ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49756][SQL][FOLLOWUP] Use correct pgsql datetime fields when pushing down EXTRACT [spark]
cloud-fan commented on PR #50101: URL: https://github.com/apache/spark/pull/50101#issuecomment-2687368062 cc @beliefer @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]
cloud-fan commented on code in PR #50040: URL: https://github.com/apache/spark/pull/50040#discussion_r1973251531 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -5545,6 +5545,15 @@ object SQLConf { .booleanConf .createWithDefault(false) + val LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION = +buildConf("spark.sql.legacy.dataFrameWriterV2IgnorePathOption") + .internal() + .doc("When set to true, DataFrameWriterV2 ignores the 'path' option and always write data " + +"to the default table location.") + .version("3.5.5") Review Comment: ```suggestion .version("3.5.6") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50994][CORE] Perform RDD conversion under tracked execution [spark]
cloud-fan commented on PR #49678: URL: https://github.com/apache/spark/pull/49678#issuecomment-2687493241 thanks, merging to master/4.0! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48375][SQL] Add support for SIGNAL statement [spark]
cloud-fan commented on code in PR #49726: URL: https://github.com/apache/spark/pull/49726#discussion_r1973284907 ## sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionSuite.scala: ## @@ -69,6 +70,222 @@ class SqlScriptingExecutionSuite extends QueryTest with SharedSparkSession { } } + // Signal tests + test("signal statement - condition") { +val sqlScript = + """ +|BEGIN +| DECLARE TEST_CONDITION CONDITION FOR SQLSTATE '12345'; +| SIGNAL TEST_CONDITION; +|END +|""".stripMargin +val exception = intercept[SqlScriptingRuntimeException] { + verifySqlScriptResult(sqlScript, Seq.empty) +} +checkError( Review Comment: posting the error message here as a PR comment is good enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51335] Publish Apache Spark 3.5.5 to docker registry [spark-docker]
dongjoon-hyun opened a new pull request, #80: URL: https://github.com/apache/spark-docker/pull/80 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50994][CORE] Perform RDD conversion under tracked execution [spark]
cloud-fan commented on code in PR #49678: URL: https://github.com/apache/spark/pull/49678#discussion_r1973281569 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -2721,6 +2721,25 @@ class DataFrameSuite extends QueryTest parameters = Map("name" -> ".whatever") ) } + + test("SPARK-50994: RDD conversion is performed with execution context") { +withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") { Review Comment: I see, let's merge it as it is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50994][CORE] Perform RDD conversion under tracked execution [spark]
cloud-fan closed pull request #49678: [SPARK-50994][CORE] Perform RDD conversion under tracked execution URL: https://github.com/apache/spark/pull/49678 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51336] Upgrade `upload-artifact` to v4 in `spark-docker` repository [spark-docker]
dongjoon-hyun opened a new pull request, #81: URL: https://github.com/apache/spark-docker/pull/81 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51310][SQL] Resolve the type of default string producing expressions [spark]
cloud-fan commented on PR #50053: URL: https://github.com/apache/spark/pull/50053#issuecomment-2687505463 thanks, merging to master/4.0! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51310][SQL] Resolve the type of default string producing expressions [spark]
cloud-fan closed pull request #50053: [SPARK-51310][SQL] Resolve the type of default string producing expressions URL: https://github.com/apache/spark/pull/50053 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51341][CORE] Cancel time task with suitable way. [spark]
beliefer commented on code in PR #50107: URL: https://github.com/apache/spark/pull/50107#discussion_r1973643904 ## core/src/main/scala/org/apache/spark/BarrierTaskContext.scala: ## @@ -300,11 +300,7 @@ object BarrierTaskContext { @Since("2.4.0") def get(): BarrierTaskContext = TaskContext.get().asInstanceOf[BarrierTaskContext] - private val timer = { -val executor = ThreadUtils.newDaemonSingleThreadScheduledExecutor( - "Barrier task timer for barrier() calls.") -assert(executor.isInstanceOf[ScheduledThreadPoolExecutor]) -executor.asInstanceOf[ScheduledThreadPoolExecutor] - } + private val timer = ThreadUtils.newSingleThreadScheduledExecutor( Review Comment: The origin code is non-daemon thread here. Please see https://github.com/apache/spark/commit/5d5b3a54b7b5fb4308fe40da696ba805c72983fc#diff-e9f66d336ded6632366be06dbd70e74afe9595cfc92a962f2c8554f09af53d0fL286 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]
jjayadeep06 commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1973705329 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -122,23 +124,40 @@ private[spark] class BarrierCoordinator( // Init a TimerTask for a barrier() call. private def initTimerTask(state: ContextBarrierState): Unit = { timerTask = new TimerTask { -override def run(): Unit = state.synchronized { - // Timeout current barrier() call, fail all the sync requests. - requesters.foreach(_.sendFailure(new SparkException("The coordinator didn't get all " + -s"barrier sync requests for barrier epoch $barrierEpoch from $barrierId within " + -s"$timeoutInSecs second(s)."))) - cleanupBarrierStage(barrierId) -} +override def run(): Unit = + try { +state.synchronized { + if (!Thread.currentThread().isInterrupted()) { +// Timeout current barrier() call, fail all the sync requests. +requesters.foreach( + _.sendFailure(new SparkException("The coordinator didn't get all " + +s"barrier sync requests for barrier epoch +" + +s" $barrierEpoch from $barrierId within " + +s"$timeoutInSecs second(s)."))) +cleanupBarrierStage(barrierId) + } +} + } catch { +case _: InterruptedException => + // Handle interruption gracefully + Thread.currentThread().interrupt() +case e: Exception => new SparkException("Error during " + + s"running of barrier tasks for " + + s"$barrierId", e) + } finally { +// Ensure cleanup happens even if interrupted or exception occurs +cleanupBarrierStage(barrierId) +state.clear() + } Review Comment: The `cleanupBarrierStage` is called as part of cleanup process in `OnStageCompleted` and `state.clear` is called [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/BarrierCoordinator.scala#L230) as part of `OnStop`, hence I introduced it in the `finally` block as part of the cleanup process as part of Thread interruption cleanup -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51307][SQL] locationUri in CatalogStorageFormat shall be decoded for display [spark]
cloud-fan commented on code in PR #50074: URL: https://github.com/apache/spark/pull/50074#discussion_r1973627334 ## sql/core/src/test/resources/sql-tests/inputs/describe.sql: ## @@ -122,6 +122,12 @@ DESC TABLE EXTENDED e; DESC FORMATTED e; +CREATE TABLE f PARTITIONED BY (B, C) AS SELECT 'APACHE' A, CAST('SPARK' AS BINARY) B, TIMESTAMP'2018-11-17 13:33:33' C; Review Comment: does it create a hive table, shall we put `USING json` to explicitly use spark natvie data source? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51303] [SQL] [TESTS] Extend `ORDER BY` testing coverage [spark]
mihailoale-db commented on PR #50069: URL: https://github.com/apache/spark/pull/50069#issuecomment-2688377111 @MaxGekk @cloud-fan Failures don't seem related to changes? Please check when you have time. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51336] Upgrade `upload-artifact` to v4 in `spark-docker` repository [spark-docker]
dongjoon-hyun merged PR #81: URL: https://github.com/apache/spark-docker/pull/81 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51336] Upgrade `upload-artifact` to v4 in `spark-docker` repository [spark-docker]
dongjoon-hyun commented on PR #81: URL: https://github.com/apache/spark-docker/pull/81#issuecomment-2688408435 Thank you, @viirya ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51270][SQL] Support UUID type in Variant [spark]
cashmand commented on code in PR #50025: URL: https://github.com/apache/spark/pull/50025#discussion_r1973900282 ## common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java: ## @@ -240,6 +242,19 @@ public void appendBinary(byte[] binary) { writePos += binary.length; } + public void appendUuid(UUID uuid) { +checkCapacity(1 + 16); +writeBuffer[writePos++] = primitiveHeader(UUID); + +// UUID is stored big-endian, so don't use writeLong. +ByteBuffer buffer = ByteBuffer.allocate(16); Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51335] Publish Apache Spark 3.5.5 to docker registry [spark-docker]
viirya commented on PR #80: URL: https://github.com/apache/spark-docker/pull/80#issuecomment-2688494398 This error happened (even after re-triggering): ``` 54.80 qemu: uncaught target signal 11 (Segmentation fault) - core dumped 55.22 Segmentation fault (core dumped) 55.27 qemu: uncaught target signal 11 (Segmentation fault) - core dumped 55.64 Segmentation fault (core dumped) : Sub-process /usr/bin/dpkg returned an error code (1) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51336] Upgrade `upload-artifact` to v4 in `spark-docker` repository [spark-docker]
LuciferYang commented on PR #81: URL: https://github.com/apache/spark-docker/pull/81#issuecomment-2688468782 late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51335] Publish Apache Spark 3.5.5 to docker registry [spark-docker]
dongjoon-hyun commented on PR #80: URL: https://github.com/apache/spark-docker/pull/80#issuecomment-2688513900 Let me hold on this because this is not a blocker For Apache Spark 3.5.5 release announcement. After announcing, I'll come back to this. Thank you, @viirya . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51335] Publish Apache Spark 3.5.5 to docker registry [spark-docker]
dongjoon-hyun commented on PR #80: URL: https://github.com/apache/spark-docker/pull/80#issuecomment-2688511310 Ya, it seems to come from the Qemu environment on GitHub Action. When I build the docker locally, it seems to work. ``` $ docker build . [+] Building 423.4s (12/12) FINISHED docker:desktop-linux => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 3.08kB 0.0s => [internal] load metadata for docker.io/library/eclipse-temurin:17-jammy 1.3s => [auth] library/eclipse-temurin:pull token for registry-1.docker.io 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [1/6] FROM docker.io/library/eclipse-temurin:17-jammy@sha256:1886bea164fd66274feb2031e5a12be3be830949c57491c51cd3d432dafda89f 0.0s => => resolve docker.io/library/eclipse-temurin:17-jammy@sha256:1886bea164fd66274feb2031e5a12be3be830949c57491c51cd3d432dafda89f 0.0s => => sha256:1886bea164fd66274feb2031e5a12be3be830949c57491c51cd3d432dafda89f 6.55kB / 6.55kB 0.0s => => sha256:db3ac3d3f94412d3debc28603f37c00a46e773d619df4e21ef018ab1ce6681cd 1.94kB / 1.94kB 0.0s => => sha256:c1f6b0a95ff9591d42a346fca88e1cd27633e580e7514397e79f3d25598d30ee 6.10kB / 6.10kB 0.0s => [internal] load build context 0.0s => => transferring context: 4.78kB 0.0s => [2/6] RUN groupadd --system --gid=185 spark && useradd --system --uid=185 --gid=spark spark 0.1s => [3/6] RUN set -ex; apt-get update; apt-get install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu libnss-wrapper; mkdir -p /opt/spark; mkdir /opt/spark/python; mkdir -p 15.1s => [4/6] RUN set -ex; export SPARK_TMP="$(mktemp -d)"; cd $SPARK_TMP; wget -nv -O spark.tgz "https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz";; wget -nv -O spark.tgz.asc "https://archi 406.1s => [5/6] COPY entrypoint.sh /opt/ 0.0s => [6/6] WORKDIR /opt/spark/work-dir 0.0s => exporting to image
Re: [PR] [SPARK-51335] Publish Apache Spark 3.5.5 to docker registry [spark-docker]
viirya commented on PR #80: URL: https://github.com/apache/spark-docker/pull/80#issuecomment-2688522180 Yes. Looks like an issue from the infrastructure. Thanks @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51335] Publish Apache Spark 3.5.5 to docker registry [spark-docker]
pan3793 commented on PR #80: URL: https://github.com/apache/spark-docker/pull/80#issuecomment-2688551158 this might be a solution https://github.com/apache/iceberg/pull/12262 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49756][SQL][FOLLOWUP] Use correct pgsql datetime fields when pushing down EXTRACT [spark]
beliefer commented on code in PR #50101: URL: https://github.com/apache/spark/pull/50101#discussion_r1974672469 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala: ## @@ -303,12 +303,27 @@ private case class PostgresDialect() class PostgresSQLBuilder extends JDBCSQLBuilder { override def visitExtract(field: String, source: String): String = { - field match { -case "DAY_OF_YEAR" => s"EXTRACT(DOY FROM $source)" -case "YEAR_OF_WEEK" => s"EXTRACT(YEAR FROM $source)" -case "DAY_OF_WEEK" => s"EXTRACT(DOW FROM $source)" -case _ => super.visitExtract(field, source) + // SECOND, MINUTE, HOUR, QUARTER, YEAR, DAY are identical on postgres and spark + // MONTHis different, postgres returns 0-11, spark returns 1-12. + // Postgres also returns 1-12 but just for interval columns, so without source + // data type, we cannot know how to push down it Review Comment: ![Uploading 屏幕快照 2025-02-28 上午10.37.07.png…]() -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49756][SQL][FOLLOWUP] Use correct pgsql datetime fields when pushing down EXTRACT [spark]
beliefer commented on code in PR #50101: URL: https://github.com/apache/spark/pull/50101#discussion_r1974672789 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala: ## @@ -303,12 +303,27 @@ private case class PostgresDialect() class PostgresSQLBuilder extends JDBCSQLBuilder { override def visitExtract(field: String, source: String): String = { - field match { -case "DAY_OF_YEAR" => s"EXTRACT(DOY FROM $source)" -case "YEAR_OF_WEEK" => s"EXTRACT(YEAR FROM $source)" -case "DAY_OF_WEEK" => s"EXTRACT(DOW FROM $source)" -case _ => super.visitExtract(field, source) + // SECOND, MINUTE, HOUR, QUARTER, YEAR, DAY are identical on postgres and spark + // MONTHis different, postgres returns 0-11, spark returns 1-12. + // Postgres also returns 1-12 but just for interval columns, so without source + // data type, we cannot know how to push down it Review Comment:  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51307][SQL] locationUri in CatalogStorageFormat shall be decoded for display [spark]
yaooqinn commented on code in PR #50074: URL: https://github.com/apache/spark/pull/50074#discussion_r1974620691 ## sql/core/src/test/resources/sql-tests/inputs/describe.sql: ## @@ -122,6 +122,12 @@ DESC TABLE EXTENDED e; DESC FORMATTED e; +CREATE TABLE f PARTITIONED BY (B, C) AS SELECT 'APACHE' A, CAST('SPARK' AS BINARY) B, TIMESTAMP'2018-11-17 13:33:33' C; Review Comment: Not necessary, the default behavior w/o `USING` keyword defaults to parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51307][SQL] locationUri in CatalogStorageFormat shall be decoded for display [spark]
yaooqinn commented on code in PR #50074: URL: https://github.com/apache/spark/pull/50074#discussion_r1974624507 ## sql/core/src/test/resources/sql-tests/results/describe.sql.out: ## @@ -890,6 +890,48 @@ a string CONCAT('a\n b\n ', 'c\n d') b int 42 +-- !query +CREATE TABLE f PARTITIONED BY (B, C) AS SELECT 'APACHE' A, CAST('SPARK' AS BINARY) B, TIMESTAMP'2018-11-17 13:33:33' C +-- !query schema +struct<> +-- !query output + + + +-- !query +DESC FORMATTED f PARTITION (B='SPARK', C=TIMESTAMP'2018-11-17 13:33:33') +-- !query schema +struct +-- !query output +A string +B binary +C timestamp +# Partition Information +# col_name data_type comment +B binary +C timestamp + +# Detailed Partition Information +Database default +Table f +Partition Values [B=SPARK, C=2018-11-17 13:33:33] +Location [not included in comparison]/{warehouse_dir}/f/B=SPARK/C=2018-11-17 13%3A33%3A33 +Partition Parameters [numFiles=1, totalSize=486, transient_lastDdlTime=[not included in comparison]] Review Comment: > we have transient_lastDdlTime because it's a hive table? This is not from table parameters, but partition stats, I'm not aware of the history of filtering out some of the hive properties. However, it's not related to this PR. > Maybe we can test with ... AS JSON to avoid seeing these unrelated fields. We already have an `AS JSON` test variant, unfortunately transient_lastDdlTime is still there -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51339][BUILD] Remove `IllegalImportsChecker` for `s.c.Seq/IndexedSeq` from `scalastyle-config.xml` [spark]
LuciferYang commented on PR #50105: URL: https://github.com/apache/spark/pull/50105#issuecomment-2689564664 Merged into master and branch-4.0. Thanks @HyukjinKwon and @the-sakthi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49756][SQL][FOLLOWUP] Use correct pgsql datetime fields when pushing down EXTRACT [spark]
cloud-fan commented on code in PR #50101: URL: https://github.com/apache/spark/pull/50101#discussion_r1974712264 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala: ## @@ -304,11 +309,25 @@ class PostgresIntegrationSuite extends DockerJDBCIntegrationV2Suite with V2JDBCT assert(rows7.length === 1) assert(rows7(0).getString(0) === "alex") -val df8 = sql(s"SELECT name FROM $tbl WHERE dayofweek(date1) = 4") -checkFilterPushed(df8) -val rows8 = df8.collect() -assert(rows8.length === 1) -assert(rows8(0).getString(0) === "alex") +withClue("dayofweek") { + val dow = sql(s"SELECT dayofweek(date1) FROM $tbl WHERE name = 'alex'") +.collect().head.getInt(0) + val df = sql(s"SELECT name FROM $tbl WHERE dayofweek(date1) = $dow") Review Comment: A better idea is probably have a test infra that automatically checks the result between JDBC pushdown on and off. I'll leave it to followups. ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala: ## @@ -304,11 +309,25 @@ class PostgresIntegrationSuite extends DockerJDBCIntegrationV2Suite with V2JDBCT assert(rows7.length === 1) assert(rows7(0).getString(0) === "alex") -val df8 = sql(s"SELECT name FROM $tbl WHERE dayofweek(date1) = 4") -checkFilterPushed(df8) -val rows8 = df8.collect() -assert(rows8.length === 1) -assert(rows8(0).getString(0) === "alex") +withClue("dayofweek") { + val dow = sql(s"SELECT dayofweek(date1) FROM $tbl WHERE name = 'alex'") +.collect().head.getInt(0) + val df = sql(s"SELECT name FROM $tbl WHERE dayofweek(date1) = $dow") Review Comment: A better idea is probably having a test infra that automatically checks the result between JDBC pushdown on and off. I'll leave it to followups. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51352] Use Spark 3.5.5 in E2E tests [spark-kubernetes-operator]
dongjoon-hyun closed pull request #160: [SPARK-51352] Use Spark 3.5.5 in E2E tests URL: https://github.com/apache/spark-kubernetes-operator/pull/160 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51347] Enable Ingress and Service Support for Spark Driver [spark-kubernetes-operator]
dongjoon-hyun closed pull request #159: [SPARK-51347] Enable Ingress and Service Support for Spark Driver URL: https://github.com/apache/spark-kubernetes-operator/pull/159 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51347] Enable Ingress and Service Support for Spark Driver [spark-kubernetes-operator]
dongjoon-hyun commented on PR #159: URL: https://github.com/apache/spark-kubernetes-operator/pull/159#issuecomment-2689793279 BTW, I noticed that this is a first commit with a different email, @jiangzho . Are you going to use this one? ``` $ git log --author=zh | grep 'Author:' | sort | uniq -c 1 Author: Zhou JIANG 20 Author: zhou-jiang ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org