Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1984797430 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScanBuilder.scala: ## @@ -25,6 +27,40 @@ class PythonScanBuilder( ds: Python

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1984617330 ## python/pyspark/sql/datasource.py: ## @@ -234,6 +249,62 @@ def streamReader(self, schema: StructType) -> "DataSourceStreamReader": ) +ColumnPath = Tup

Re: [PR] [MINOR][SQL] Slightly refactor and optimize illegaility check in Recursive CTE Subqueries [spark]

2025-03-07 Thread via GitHub
peter-toth commented on code in PR #50208: URL: https://github.com/apache/spark/pull/50208#discussion_r1985025062 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -1037,10 +1034,26 @@ trait CheckAnalysis extends LookupCatalog with

Re: [PR] [SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on PR #49955: URL: https://github.com/apache/spark/pull/49955#issuecomment-2706427848 > I think LocalLimit(n)'s purpose is to provide a cheap max n row limiter. We don't have a user-facing API for local limit and local limit is always generated from a global limit,

Re: [PR] [MINOR][SQL] Slightly refactor and optimize illegaility check in Recursive CTE Subqueries [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on code in PR #50208: URL: https://github.com/apache/spark/pull/50208#discussion_r1985031301 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -1037,10 +1034,26 @@ trait CheckAnalysis extends LookupCatalog with

Re: [PR] [SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on code in PR #49955: URL: https://github.com/apache/spark/pull/49955#discussion_r1985050893 ## sql/core/src/main/scala/org/apache/spark/sql/execution/UnionLoopExec.scala: ## @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] [MINOR][SQL] Format the SqlBaseParser.g4 [spark]

2025-03-07 Thread via GitHub
beliefer commented on PR #49987: URL: https://github.com/apache/spark/pull/49987#issuecomment-2707785792 @dongjoon-hyun Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1985908707 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScanBuilder.scala: ## @@ -25,6 +27,40 @@ class PythonScanBuilder( ds: Python

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1985910327 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/UserDefinedPythonDataSource.scala: ## @@ -300,6 +321,94 @@ private class UserDefinedPyt

Re: [PR] [SPARK-45278][YARN] Support executor bind address in Yarn executors [spark]

2025-03-07 Thread via GitHub
github-actions[bot] closed pull request #47892: [SPARK-45278][YARN] Support executor bind address in Yarn executors URL: https://github.com/apache/spark/pull/47892 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1985895545 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScanBuilder.scala: ## @@ -25,6 +27,40 @@ class PythonScanBuilder( ds: Python

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1985905297 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScan.scala: ## @@ -16,26 +16,43 @@ */ package org.apache.spark.sql.execution.d

Re: [PR] [SPARK-51437][CORE] Let timeoutCheckingTask could response thread interrupt [spark]

2025-03-07 Thread via GitHub
beliefer commented on PR #50211: URL: https://github.com/apache/spark/pull/50211#issuecomment-2707862706 ping @srowen @dongjoon-hyun @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [SPARK-51364][SQL][TESTS] Improve the integration tests for external data source by check filter pushed down [spark]

2025-03-07 Thread via GitHub
dongjoon-hyun closed pull request #50126: [SPARK-51364][SQL][TESTS] Improve the integration tests for external data source by check filter pushed down URL: https://github.com/apache/spark/pull/50126 -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] [SPARK-51229][BUILD][CONNECT] Fix dependency:analyze goal on connect common [spark]

2025-03-07 Thread via GitHub
vrozov commented on code in PR #49971: URL: https://github.com/apache/spark/pull/49971#discussion_r1985642033 ## sql/connect/common/pom.xml: ## @@ -142,8 +218,26 @@ org.spark-project.spark:unused

Re: [PR] [SPARK-51365][SQL][TESTS] Reduce `SHUFFLE_EXCHANGE_MAX_THREAD_THRESHOLD/RESULT_QUERY_STAGE_MAX_THREAD_THRESHOLD` for tests related to `SharedSparkSession/TestHive` when using `macOS + Apple S

2025-03-07 Thread via GitHub
LuciferYang commented on code in PR #50206: URL: https://github.com/apache/spark/pull/50206#discussion_r1985995682 ## .github/workflows/build_and_test.yml: ## @@ -1290,3 +1290,20 @@ jobs: cd ui-test npm install --save-dev node --experimental-vm-m

Re: [PR] [SPARK-51365][SQL][TESTS] Reduce `SHUFFLE_EXCHANGE_MAX_THREAD_THRESHOLD/RESULT_QUERY_STAGE_MAX_THREAD_THRESHOLD` for tests related to `SharedSparkSession/TestHive` when using `macOS + Apple S

2025-03-07 Thread via GitHub
LuciferYang commented on code in PR #50206: URL: https://github.com/apache/spark/pull/50206#discussion_r1985999507 ## .github/workflows/build_maven_java21_macos15.yml: ## @@ -36,5 +36,9 @@ jobs: os: macos-15 envs: >- { - "OBJC_DISABLE_INITIALIZE_F

Re: [PR] [SPARK-48231][BUILD] Remove unused CodeHaus Jackson dependencies [spark]

2025-03-07 Thread via GitHub
dongjoon-hyun commented on PR #46521: URL: https://github.com/apache/spark/pull/46521#issuecomment-2706974909 Thank you for checking, @pan3793 . Are you assuming to rebuild all Hive UDF jars here? I'm wondering if you are presenting the test result with old Hive built-UDF jars here.

Re: [PR] [SPARK-48231][BUILD] Remove unused CodeHaus Jackson dependencies [spark]

2025-03-07 Thread via GitHub
pan3793 commented on PR #46521: URL: https://github.com/apache/spark/pull/46521#issuecomment-2707020792 > Are you assuming to rebuild all Hive UDF jars here? @dongjoon-hyun I never made such an assumption, most of the existing UDFs should work without any change, except to: the UDFs e

Re: [PR] [SPARK-48231][BUILD] Remove unused CodeHaus Jackson dependencies [spark]

2025-03-07 Thread via GitHub
dongjoon-hyun commented on PR #46521: URL: https://github.com/apache/spark/pull/46521#issuecomment-2706978635 BTW, thank you for taking a look at removing this. I support your direction and I hope we can revisit this with you for Apache Spark 4.1.0 timeframe, @pan3793 . -- This is an a

Re: [PR] [SPARK-48231][BUILD] Remove unused CodeHaus Jackson dependencies [spark]

2025-03-07 Thread via GitHub
dongjoon-hyun commented on PR #46521: URL: https://github.com/apache/spark/pull/46521#issuecomment-2707044958 > In short, my conclusion is, we should and must keep all jars required by Hive built-in UDF to allow `o.a.h.hive.ql.exec.FunctionRegistry` initialization, for other jars like commo

Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

2025-03-07 Thread via GitHub
ahshahid commented on code in PR #50033: URL: https://github.com/apache/spark/pull/50033#discussion_r1985528366 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -3185,16 +3197,106 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSpa

Re: [PR] [SPARK-50763][SQL] Add Analyzer rule for resolving SQL table functions [spark]

2025-03-07 Thread via GitHub
allisonwang-db commented on code in PR #49471: URL: https://github.com/apache/spark/pull/49471#discussion_r1985588345 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala: ## @@ -1675,6 +1676,86 @@ class SessionCatalog( } } + /** +

Re: [PR] [SPARK-50763][SQL] Add Analyzer rule for resolving SQL table functions [spark]

2025-03-07 Thread via GitHub
allisonwang-db commented on code in PR #49471: URL: https://github.com/apache/spark/pull/49471#discussion_r1985591167 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala: ## @@ -1675,6 +1676,86 @@ class SessionCatalog( } } + /** +

Re: [PR] [MINOR][CORE] Remove redundant synchronized in ThreadUtils [spark]

2025-03-07 Thread via GitHub
jinkachy commented on PR #50210: URL: https://github.com/apache/spark/pull/50210#issuecomment-2707270252 > Please config github action. done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [MINOR][SQL] Improve readability of JDBC truncate table condition check [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #50207: URL: https://github.com/apache/spark/pull/50207#discussion_r1985965363 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala: ## @@ -54,7 +54,7 @@ class JdbcRelationProvider extends Creatabl

Re: [PR] [WIP][SPARK-51436][CORE][SQL][K8s][SS] Fix bug that cancel Future specified mayInterruptIfRunning with true [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #50209: URL: https://github.com/apache/spark/pull/50209#discussion_r1985965443 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala: ## @@ -63,7 +63,7 @@ class Executor

Re: [PR] [SPARK-50639][SQL] Improve warning logging in CacheManager [spark]

2025-03-07 Thread via GitHub
vrozov commented on PR #49276: URL: https://github.com/apache/spark/pull/49276#issuecomment-2704365043 @hvanhovell ? @dongjoon-hyun ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [WIP][SPARK-51436][CORE][SQL][K8s][SS] Fix bug that cancel Future specified mayInterruptIfRunning with true [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #50209: URL: https://github.com/apache/spark/pull/50209#discussion_r1985966155 ## sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala: ## @@ -108,7 +108,7 @@ trait BroadcastExchangeLike extends Exchange {

[PR] [MINOR][SPARK-SQL] Improve readability of JDBC truncate table conditiā€¦ [spark]

2025-03-07 Thread via GitHub
jinkachy opened a new pull request, #50207: URL: https://github.com/apache/spark/pull/50207 ### What changes were proposed in this pull request? This PR improves the readability of the JDBC truncate table condition check by clarifying the logic flow in the code. The commit rewrites th

Re: [PR] [SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

2025-03-07 Thread via GitHub
Pajaraja commented on code in PR #49955: URL: https://github.com/apache/spark/pull/49955#discussion_r1985302503 ## sql/core/src/main/scala/org/apache/spark/sql/execution/UnionLoopExec.scala: ## @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] [SPARK-47849][PYTHON][CONNECT] Change release script to release pyspark-client [spark]

2025-03-07 Thread via GitHub
HyukjinKwon commented on PR #50203: URL: https://github.com/apache/spark/pull/50203#issuecomment-2705626161 Merged to master and branch-4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1984790606 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/UserDefinedPythonDataSource.scala: ## @@ -300,6 +321,94 @@ private class UserDefinedPyt

Re: [PR] [SPARK-51314][DOCS][PS] Add proper note for distributed-sequence about indeterministic case [spark]

2025-03-07 Thread via GitHub
the-sakthi commented on PR #50086: URL: https://github.com/apache/spark/pull/50086#issuecomment-2704944701 I noticed a very minor grammatical nit here, apologies for the oversight. Have created a PR to quickly address that: https://github.com/apache/spark/pull/50196 @itholic --

[PR] [SPARK-47849][PYTHON][CONNECT] Change release script to release pyspark-connect [spark]

2025-03-07 Thread via GitHub
HyukjinKwon opened a new pull request, #50203: URL: https://github.com/apache/spark/pull/50203 ### What changes were proposed in this pull request? This PR proposes to change release script to publish `pyspark-client`. ### Why are the changes needed? We should have the re

Re: [PR] [SPARK-51384][SQL] Support `java.time.LocalTime` as the external type of `TimeType` [spark]

2025-03-07 Thread via GitHub
MaxGekk commented on PR #50153: URL: https://github.com/apache/spark/pull/50153#issuecomment-2702838681 Merging to master. Thank you, @yaooqinn @dongjoon-hyun for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1984839577 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScanBuilder.scala: ## @@ -25,6 +27,40 @@ class PythonScanBuilder( ds: Python

[PR] Revert "[SPARK-51396][SQL] RuntimeConfig.getOption shouldn't use exceptions for control flow" [spark]

2025-03-07 Thread via GitHub
JoshRosen opened a new pull request, #50200: URL: https://github.com/apache/spark/pull/50200 ### What changes were proposed in this pull request? This reverts commit db06293dd100b4f2a4efe3e7624a9be2345e6575 / https://github.com/apache/spark/pull/50167. That PR introduced a subt

[PR] [MINOR][SQL] Slightly refactor and optimize illegaility check in Recursive CTE Subqueries [spark]

2025-03-07 Thread via GitHub
Pajaraja opened a new pull request, #50208: URL: https://github.com/apache/spark/pull/50208 ### What changes were proposed in this pull request? Change the place where we check whether there is a recursive CTE within a subquery. Also, change implementation to be instead of collecting

Re: [PR] [SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

2025-03-07 Thread via GitHub
peter-toth commented on PR #49955: URL: https://github.com/apache/spark/pull/49955#issuecomment-2706311534 + A side note, for those usecases where the `UnionLoop` is infinite we should probaby introduce a config similar to `spark.sql.cteRecursionLevelLimit`, but limit the number of rows ret

[PR] [MINOR][SPARK-CORE] Remove redundant synchronized in ThreadUtils [spark]

2025-03-07 Thread via GitHub
jinkachy opened a new pull request, #50210: URL: https://github.com/apache/spark/pull/50210 ### What changes were proposed in this pull request? This PR removes the redundant `synchronized` keyword from the `isTerminated` method in `sameThreadExecutorService()` imp

Re: [PR] [SPARK-51350][SQL] Implement Show Procedures [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on code in PR #50109: URL: https://github.com/apache/spark/pull/50109#discussion_r1984471179 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala: ## @@ -651,5 +651,4 @@ class InMemoryCatalog( requireDbExists(db)

Re: [PR] [SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

2025-03-07 Thread via GitHub
peter-toth commented on PR #49955: URL: https://github.com/apache/spark/pull/49955#issuecomment-2706392622 > This makes sense, but I wonder how would we tell apart the cases where it's an infinite recursion, and we're returning the first k (k modifiable in flag) results vs a finite (but ver

Re: [PR] [MINOR][SQL] Slightly refactor and optimize illegaility check in Recursive CTE Subqueries [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on PR #50208: URL: https://github.com/apache/spark/pull/50208#issuecomment-2706396016 thanks, merging to master/4.0! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [PR] [MINOR][SQL] Slightly refactor and optimize illegaility check in Recursive CTE Subqueries [spark]

2025-03-07 Thread via GitHub
cloud-fan closed pull request #50208: [MINOR][SQL] Slightly refactor and optimize illegaility check in Recursive CTE Subqueries URL: https://github.com/apache/spark/pull/50208 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

[PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]

2025-03-07 Thread via GitHub
vrozov opened a new pull request, #50213: URL: https://github.com/apache/spark/pull/50213 ### What changes were proposed in this pull request? Upgrade Hive compile time dependency to 4.0.1 ### Why are the changes needed? Apache Hive 1.x, 2.x and 3.x are EOL ### Doe

Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]

2025-03-07 Thread via GitHub
attilapiros commented on PR #50122: URL: https://github.com/apache/spark/pull/50122#issuecomment-2706931664 cc @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [SPARK-51182][SQL] DataFrameWriter should throw dataPathNotSpecifiedError when path is not specified [spark]

2025-03-07 Thread via GitHub
vrozov commented on code in PR #49928: URL: https://github.com/apache/spark/pull/49928#discussion_r1985384660 ## sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameReaderWriterSuite.java: ## @@ -152,4 +159,16 @@ public void testOrcAPI() { spark.read().schema(sche

Re: [PR] [SPARK-51425][Connect] Add client API to set custom `operation_id` [spark]

2025-03-07 Thread via GitHub
vicennial commented on PR #50191: URL: https://github.com/apache/spark/pull/50191#issuecomment-2706706239 Thanks for the review! CI is green after some lint changes :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use t

[PR] [SPARK-51438][SQL] Make CatalystDataToProtobuf and ProtobufDataToCatalyst properly comparable and hashable [spark]

2025-03-07 Thread via GitHub
vladimirg-db opened a new pull request, #50212: URL: https://github.com/apache/spark/pull/50212 ### What changes were proposed in this pull request? Hand-roll `equals` and `hashCode` for `CatalystDataToProtobuf` and `ProtobufDataToCatalyst`. ### Why are the changes needed?

Re: [PR] [MINOR][SPARK-CORE] Remove redundant synchronized in ThreadUtils [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #50210: URL: https://github.com/apache/spark/pull/50210#discussion_r1985118560 ## core/src/main/scala/org/apache/spark/util/ThreadUtils.scala: ## @@ -65,7 +65,7 @@ private[spark] object ThreadUtils { } } -override def isTerminat

Re: [PR] [SPARK-51365][SQL][TESTS] Reduce `SHUFFLE_EXCHANGE_MAX_THREAD_THRESHOLD/RESULT_QUERY_STAGE_MAX_THREAD_THRESHOLD` for tests related to `SharedSparkSession/TestHive` when using `macOS + Apple S

2025-03-07 Thread via GitHub
LuciferYang commented on code in PR #50206: URL: https://github.com/apache/spark/pull/50206#discussion_r1985261482 ## sql/core/src/test/scala/org/apache/spark/sql/test/SharedSparkSession.scala: ## @@ -79,6 +80,15 @@ trait SharedSparkSessionBase StaticSQLConf.WAREHOUSE_PAT

[PR] [WIP][SPARK-51437][CORE] Let timeoutCheckingTask could response thread interrupt [spark]

2025-03-07 Thread via GitHub
beliefer opened a new pull request, #50211: URL: https://github.com/apache/spark/pull/50211 ### What changes were proposed in this pull request? This PR proposes to let `timeoutCheckingTask` could response thread interrupt. ### Why are the changes needed? Currently, we cance

Re: [PR] [SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

2025-03-07 Thread via GitHub
Pajaraja commented on PR #49955: URL: https://github.com/apache/spark/pull/49955#issuecomment-2706383952 > A side note, for those usecases where the `UnionLoop` is infinite we should probaby introduce a config similar to `spark.sql.cteRecursionLevelLimit`, but to limit the number of rows re

Re: [PR] [SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on code in PR #49955: URL: https://github.com/apache/spark/pull/49955#discussion_r1985063422 ## sql/core/src/main/scala/org/apache/spark/sql/execution/UnionLoopExec.scala: ## @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] [SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

2025-03-07 Thread via GitHub
Pajaraja commented on code in PR #49955: URL: https://github.com/apache/spark/pull/49955#discussion_r1985300518 ## sql/core/src/main/scala/org/apache/spark/sql/execution/UnionLoopExec.scala: ## @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] [SPARK-51366][SQL] Add a new visitCaseWhen method to V2ExpressionSQLBuilder [spark]

2025-03-07 Thread via GitHub
beliefer closed pull request #50129: [SPARK-51366][SQL] Add a new visitCaseWhen method to V2ExpressionSQLBuilder URL: https://github.com/apache/spark/pull/50129 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [PR] [SPARK-50992][SQL] OOMs and performance issues with AQE in large plans [spark]

2025-03-07 Thread via GitHub
JackBuggins commented on PR #49724: URL: https://github.com/apache/spark/pull/49724#issuecomment-2706585219 Strongly agree with @SauronShepherd, many will have workflows where the final plan on the UI is not critical, many opt to debug and understand plans via explain. Off and Off wi

Re: [PR] [SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

2025-03-07 Thread via GitHub
peter-toth commented on PR #49955: URL: https://github.com/apache/spark/pull/49955#issuecomment-2706641950 > However, we may push down local limit without global limit and at the end they can be very far away. I think we disagree a bit here. While the above is true, a `LocalLimit(n)`

Re: [PR] [SPARK-51350][SQL] Implement Show Procedures [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on code in PR #50109: URL: https://github.com/apache/spark/pull/50109#discussion_r1984475040 ## sql/core/src/test/scala/org/apache/spark/sql/connector/ProcedureSuite.scala: ## @@ -40,15 +40,23 @@ class ProcedureSuite extends QueryTest with SharedSparkSession

[PR] [WIP][SPARK-51436][SQL] Change the mayInterruptIfRunning from true to false [spark]

2025-03-07 Thread via GitHub
beliefer opened a new pull request, #50209: URL: https://github.com/apache/spark/pull/50209 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? 'No'. ### How was t

Re: [PR] [SPARK-51029][BUILD] Remove `hive-llap-common` compile dependency [spark]

2025-03-07 Thread via GitHub
pan3793 commented on PR #49725: URL: https://github.com/apache/spark/pull/49725#issuecomment-2706298521 Sorry, I can't follow the decision of removing `hive-llap-common-2.3.10.jar` from Spark dist, it technically breaks the feature "Support Hive UDF", without this jar, Spark is not able to

Re: [PR] [SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on code in PR #49955: URL: https://github.com/apache/spark/pull/49955#discussion_r1985046279 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/cteOperators.scala: ## @@ -40,7 +40,8 @@ case class UnionLoop( id: Long, anchor:

Re: [PR] [MINOR][SPARK-CORE] Remove redundant synchronized in ThreadUtils [spark]

2025-03-07 Thread via GitHub
beliefer commented on PR #50210: URL: https://github.com/apache/spark/pull/50210#issuecomment-2706543971 Please config github action. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

2025-03-07 Thread via GitHub
attilapiros commented on code in PR #50033: URL: https://github.com/apache/spark/pull/50033#discussion_r1985391930 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -3185,16 +3197,106 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocal

Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

2025-03-07 Thread via GitHub
attilapiros commented on code in PR #50033: URL: https://github.com/apache/spark/pull/50033#discussion_r1985391930 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -3185,16 +3197,106 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocal

Re: [PR] [SPARK-48231][BUILD] Remove unused CodeHaus Jackson dependencies [spark]

2025-03-07 Thread via GitHub
pan3793 commented on PR #46521: URL: https://github.com/apache/spark/pull/46521#issuecomment-2706965438 > For this one PR, I believe we need a verification for different HMS versions to make it sure. @dongjoon-hyun I managed to set up an env to test the IsolatedClassLoader, it works

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on PR #49961: URL: https://github.com/apache/spark/pull/49961#issuecomment-2705841733 For DS v2, the scan workflow is: 1. analyzer gets `Table` from `TableProvider`, and puts it in `DataSourceV2Relation` (for batch scan) or `StreamingRelationV2` (for streaming scan).

Re: [PR] [SPARK-51422][ML][PYTHON] Eliminate the JVM-Python data exchange in CrossValidator [spark]

2025-03-07 Thread via GitHub
zhengruifeng closed pull request #50184: [SPARK-51422][ML][PYTHON] Eliminate the JVM-Python data exchange in CrossValidator URL: https://github.com/apache/spark/pull/50184 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[PR] fix: handle compare_vals turn into str when parse IgnoreColumnType [spark]

2025-03-07 Thread via GitHub
phakawatfong opened a new pull request, #50205: URL: https://github.com/apache/spark/pull/50205 when I parse ignoreColumnType=True when call the function, it cast all values to be 'str' which causing the comparison between val1 and val2 to be failed. for example val1 = 1505.761895

Re: [PR] [SPARK-51365][TESTS] Test maven + macos [spark]

2025-03-07 Thread via GitHub
LuciferYang commented on PR #50178: URL: https://github.com/apache/spark/pull/50178#issuecomment-2705877001 > The issue should be resolvable. The commit history of this pr is a bit messy. I'll submit a clean one and add more descriptions tomorrow a clean one: https://github.com/apache

Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #50020: URL: https://github.com/apache/spark/pull/50020#discussion_r1984404217 ## core/src/main/scala/org/apache/spark/BarrierCoordinator.scala: ## @@ -132,13 +136,15 @@ private[spark] class BarrierCoordinator( } } -// Cancel th

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1984616581 ## python/pyspark/errors/error-conditions.json: ## @@ -189,11 +189,21 @@ "Remote client cannot create a SparkContext. Create SparkSession instead." ]

Re: [PR] [SPARK-51422][ML][PYTHON] Eliminate the JVM-Python data exchange in CrossValidator [spark]

2025-03-07 Thread via GitHub
zhengruifeng commented on PR #50184: URL: https://github.com/apache/spark/pull/50184#issuecomment-2705798151 merged to mater -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
wengh commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1984636973 ## python/pyspark/sql/datasource.py: ## @@ -234,6 +249,62 @@ def streamReader(self, schema: StructType) -> "DataSourceStreamReader": ) +ColumnPath = Tuple[s

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
wengh commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1984638659 ## python/pyspark/errors/error-conditions.json: ## @@ -189,11 +189,21 @@ "Remote client cannot create a SparkContext. Create SparkSession instead." ] }, +

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
wengh commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1984640102 ## python/pyspark/sql/datasource.py: ## @@ -234,6 +249,62 @@ def streamReader(self, schema: StructType) -> "DataSourceStreamReader": ) +ColumnPath = Tuple[s

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
wengh commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1984636973 ## python/pyspark/sql/datasource.py: ## @@ -234,6 +249,62 @@ def streamReader(self, schema: StructType) -> "DataSourceStreamReader": ) +ColumnPath = Tuple[s

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1984615352 ## python/pyspark/errors/error-conditions.json: ## @@ -189,11 +189,21 @@ "Remote client cannot create a SparkContext. Create SparkSession instead." ]

Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]

2025-03-07 Thread via GitHub
beliefer commented on PR #50020: URL: https://github.com/apache/spark/pull/50020#issuecomment-2705779882 Merged into branch-4.0/master @jjayadeep06 @srowen @jayadeep-jayaraman Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log o

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
cloud-fan commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1984618409 ## python/pyspark/sql/datasource.py: ## @@ -234,6 +249,62 @@ def streamReader(self, schema: StructType) -> "DataSourceStreamReader": ) +ColumnPath = Tup

Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]

2025-03-07 Thread via GitHub
beliefer commented on PR #50020: URL: https://github.com/apache/spark/pull/50020#issuecomment-2705780937 @jjayadeep06 Could you create a backport PR for branch-3.5 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

2025-03-07 Thread via GitHub
peter-toth commented on PR #49955: URL: https://github.com/apache/spark/pull/49955#issuecomment-2705956689 > Ideally recursive CTE should stop if the last iteration generates no data. Pushing down the LIMIT and applying an early stop is an optimization and should not change the query result

Re: [PR] [DRAFT][SQL] Old collation resolution PR [spark]

2025-03-07 Thread via GitHub
github-actions[bot] closed pull request #48844: [DRAFT][SQL] Old collation resolution PR URL: https://github.com/apache/spark/pull/48844 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

2025-03-07 Thread via GitHub
attilapiros commented on code in PR #50033: URL: https://github.com/apache/spark/pull/50033#discussion_r1985859145 ## core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala: ## @@ -1898,24 +1898,34 @@ private[spark] class DAGScheduler( // Make sure the task's acc

Re: [PR] corrected link to mllib-guide.md [spark]

2025-03-07 Thread via GitHub
github-actions[bot] closed pull request #48968: corrected link to mllib-guide.md URL: https://github.com/apache/spark/pull/48968 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #49961: URL: https://github.com/apache/spark/pull/49961#discussion_r1985896731 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonScanBuilder.scala: ## @@ -25,6 +27,40 @@ class PythonScanBuilder( ds: Python

Re: [PR] [WIP][SPARK-51436][CORE][SQL][K8s][SS] Fix bug that cancel Future specified mayInterruptIfRunning with true [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #50209: URL: https://github.com/apache/spark/pull/50209#discussion_r1985954100 ## core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala: ## @@ -250,7 +250,7 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) ov

Re: [PR] [SPARK-51437][CORE] Let timeoutCheckingTask could response thread interrupt [spark]

2025-03-07 Thread via GitHub
beliefer commented on PR #50211: URL: https://github.com/apache/spark/pull/50211#issuecomment-2707940074 I make a mistake. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [PR] [WIP][SPARK-51436][CORE][SQL][K8s][SS] Fix bug that cancel Future specified mayInterruptIfRunning with true [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #50209: URL: https://github.com/apache/spark/pull/50209#discussion_r1985957890 ## core/src/main/scala/org/apache/spark/deploy/client/StandaloneAppClient.scala: ## @@ -277,7 +277,7 @@ private[spark] class StandaloneAppClient( override def o

Re: [PR] [WIP][SPARK-51436][CORE][SQL][K8s][SS] Fix bug that cancel Future specified mayInterruptIfRunning with true [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #50209: URL: https://github.com/apache/spark/pull/50209#discussion_r1985958618 ## core/src/main/scala/org/apache/spark/deploy/master/Master.scala: ## @@ -214,10 +214,10 @@ private[deploy] class Master( applicationMetricsSystem.report()

Re: [PR] [WIP][SPARK-51436][CORE][SQL][K8s][SS] Fix bug that cancel Future specified mayInterruptIfRunning with true [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #50209: URL: https://github.com/apache/spark/pull/50209#discussion_r1985958868 ## core/src/main/scala/org/apache/spark/deploy/master/Master.scala: ## @@ -214,10 +214,10 @@ private[deploy] class Master( applicationMetricsSystem.report()

Re: [PR] [WIP][SPARK-51436][CORE][SQL][K8s][SS] Fix bug that cancel Future specified mayInterruptIfRunning with true [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #50209: URL: https://github.com/apache/spark/pull/50209#discussion_r1985959471 ## core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala: ## @@ -403,7 +403,7 @@ private[deploy] class Worker( // We have exceeded the initial regis

Re: [PR] [WIP][SPARK-51436][CORE][SQL][K8s][SS] Fix bug that cancel Future specified mayInterruptIfRunning with true [spark]

2025-03-07 Thread via GitHub
beliefer commented on code in PR #50209: URL: https://github.com/apache/spark/pull/50209#discussion_r1985955247 ## connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/FetchedDataPool.scala: ## @@ -139,7 +139,7 @@ private[consumer] class FetchedDataPool

Re: [PR] [SPARK-51437][CORE] Let timeoutCheckingTask could response thread interrupt [spark]

2025-03-07 Thread via GitHub
beliefer closed pull request #50211: [SPARK-51437][CORE] Let timeoutCheckingTask could response thread interrupt URL: https://github.com/apache/spark/pull/50211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [PR] [SPARK-51298][SQL] Support variant in CSV scan [spark]

2025-03-07 Thread via GitHub
sandip-db commented on code in PR #50052: URL: https://github.com/apache/spark/pull/50052#discussion_r1972960791 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala: ## @@ -68,6 +69,11 @@ class CSVFileFormat extends TextBasedFileFormat w

Re: [PR] [MINOR][SQL] Improve readability of JDBC truncate table condition check [spark]

2025-03-07 Thread via GitHub
jinkachy closed pull request #50207: [MINOR][SQL] Improve readability of JDBC truncate table condition check URL: https://github.com/apache/spark/pull/50207 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [SPARK-48231][BUILD] Remove unused CodeHaus Jackson dependencies [spark]

2025-03-07 Thread via GitHub
pan3793 commented on PR #46521: URL: https://github.com/apache/spark/pull/46521#issuecomment-2707065925 > > For this one PR, I believe we need a verification for different HMS versions to make it sure. > > that's a valid concern, since Spark CI only covers embedded HMS client case, l

Re: [PR] [SPARK-48231][BUILD] Remove unused CodeHaus Jackson dependencies [spark]

2025-03-07 Thread via GitHub
dongjoon-hyun commented on PR #46521: URL: https://github.com/apache/spark/pull/46521#issuecomment-2707102450 Thank you, @pan3793 ! And, sorry for your inconvenience. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [MINOR][CORE] Remove redundant synchronized in ThreadUtils [spark]

2025-03-07 Thread via GitHub
dongjoon-hyun commented on PR #50210: URL: https://github.com/apache/spark/pull/50210#issuecomment-2707360419 Merged to master. Thank you, @jinkachy and all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

  1   2   >