date:20250312

Re: [PR] [SPARK-51490] Support `iOS`, `watchOS`, and `tvOS` [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #13:
URL: 
https://github.com/apache/spark-connect-swift/pull/13#issuecomment-2718747123

   Thank you so much, @huaxingao ! Merged to main


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51484][SS] Remove unused function `private def newDFSFileName(String)` from `RocksDBFileManager` [spark]

2025-03-12 Thread via GitHub



dongjoon-hyun closed pull request #50249: [SPARK-51484][SS] Remove unused 
function `private def newDFSFileName(String)` from `RocksDBFileManager`
URL: https://github.com/apache/spark/pull/50249


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51488][SQL] Support the TIME keyword as a data type [spark]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #50250:
URL: https://github.com/apache/spark/pull/50250#issuecomment-2718780657

   Merged to master. Thank you, @MaxGekk .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation [spark]

2025-03-12 Thread via GitHub



dongjoon-hyun closed pull request #50232: [SPARK-51466][SQL][HIVE] Eliminate 
Hive built-in UDFs initialization on Hive UDF evaluation
URL: https://github.com/apache/spark/pull/50232


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation [spark]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #50232:
URL: https://github.com/apache/spark/pull/50232#issuecomment-2718785462

   Thank you, @pan3793 and @LuciferYang . Merged to master.
   
   Could you make a backporting PR to branch-4.0 to pass CI there once more, 
@pan3793 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]

2025-03-12 Thread via GitHub



LuciferYang commented on PR #50255:
URL: https://github.com/apache/spark/pull/50255#issuecomment-2718534144

   
   Will `dev/create-release/spark-rm/Dockerfile` be fixed in a separate pull 
request?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]

2025-03-12 Thread via GitHub



dongjoon-hyun closed pull request #50122: [SPARK-43221][CORE] Host local block 
fetching should use a block status of a block stored on disk
URL: https://github.com/apache/spark/pull/50122


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #50122:
URL: https://github.com/apache/spark/pull/50122#issuecomment-2718823408

   Thank you, @attilapiros and @Ngone51 .
   
   BTW, could you make two backporting PRs to branch-4.0 and branch-3.5 in 
order to make it sure to pass all CIs there, @attilapiros ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51490] Support `iOS`, `watchOS`, and `tvOS` [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun closed pull request #13: [SPARK-51490] Support `iOS`, `watchOS`, 
and `tvOS`
URL: https://github.com/apache/spark-connect-swift/pull/13


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51488][SQL] Support the TIME keyword as a data type [spark]

2025-03-12 Thread via GitHub



dongjoon-hyun closed pull request #50250: [SPARK-51488][SQL] Support the TIME 
keyword as a data type
URL: https://github.com/apache/spark/pull/50250


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #50122:
URL: https://github.com/apache/spark/pull/50122#issuecomment-2718824584

   Also, cc @mridulm , too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]

2025-03-12 Thread via GitHub



deniskuzZ commented on code in PR #50213:
URL: https://github.com/apache/spark/pull/50213#discussion_r1991644845


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala:
##
@@ -1210,7 +1210,7 @@ class HiveQuerySuite extends HiveComparisonTest with 
SQLTestUtils with BeforeAnd
 .zip(parts)
 .map { case (k, v) =>
   if (v == "NULL") {
-s"$k=${ConfVars.DEFAULTPARTITIONNAME.defaultStrVal}"
+s"$k=${ConfVars.DEFAULTPARTITIONNAME.getDefaultVal}"

Review Comment:
   I think you should use 
`org.apache.hadoop.hive.conf.HiveConf.ConfVars.DEFAULT_PARTITION_NAME`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z`… [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun opened a new pull request, #14:
URL: https://github.com/apache/spark-connect-swift/pull/14

   … version
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51492][SS]FileStreamSource: Avoid expensive file concatenation if not needed [spark]

2025-03-12 Thread via GitHub



siying opened a new pull request, #50257:
URL: https://github.com/apache/spark/pull/50257

   ### What changes were proposed in this pull request?
   In two places where constructing a log line can be very or a little bit 
expensive, avoid logTrace() call as a whole.
   
   
   ### Why are the changes needed?
   In this statement:
   
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L402
   
   files.mkString("\n\t") can be really expensive if there are many files. It 
could even cause OOM.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Use existing CI
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]

2025-03-12 Thread via GitHub



deniskuzZ commented on code in PR #50213:
URL: https://github.com/apache/spark/pull/50213#discussion_r1991644845


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala:
##
@@ -1210,7 +1210,7 @@ class HiveQuerySuite extends HiveComparisonTest with 
SQLTestUtils with BeforeAnd
 .zip(parts)
 .map { case (k, v) =>
   if (v == "NULL") {
-s"$k=${ConfVars.DEFAULTPARTITIONNAME.defaultStrVal}"
+s"$k=${ConfVars.DEFAULTPARTITIONNAME.getDefaultVal}"

Review Comment:
   I think you should use 
`org.apache.hadoop.hive.conf.HiveConf.DEFAULT_PARTITION_NAME`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #14:
URL: 
https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2718898426

   Could you review this PR when you have some time, @attilapiros ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]

2025-03-12 Thread via GitHub



zhengruifeng opened a new pull request, #50255:
URL: https://github.com/apache/spark/pull/50255

   
   
   ### What changes were proposed in this pull request?
   Refresh testing images for pyarrow 19
   
   
   ### Why are the changes needed?
   to test against the latest pyarrow
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   ci
   
   - `lint` and `python-311` will be tested in PR builder
   - for other images, we will monitor the daily builders
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]

2025-03-12 Thread via GitHub



deniskuzZ commented on code in PR #50213:
URL: https://github.com/apache/spark/pull/50213#discussion_r1991860382


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -885,20 +884,6 @@ private[hive] class HiveClientImpl(
   // Since HIVE-18238(Hive 3.0.0), the Driver.close function's return type 
changed
   // and the CommandProcessorFactory.clean function removed.
   driver.getClass.getMethod("close").invoke(driver)
-  if (version != hive.v3_0 && version != hive.v3_1 && version != 
hive.v4_0) {

Review Comment:
   2.x is ancient and should no longer be supported



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51338][INFRA] Add automated CI build for `connect-examples` [spark]

2025-03-12 Thread via GitHub



LuciferYang commented on code in PR #50187:
URL: https://github.com/apache/spark/pull/50187#discussion_r1991901327


##
connect-examples/server-library-example/pom.xml:
##
@@ -36,7 +36,8 @@
 UTF-8
 2.13
 2.13.15
-3.25.4
-4.0.0-preview2
+4.29.3
+4.1.0-SNAPSHOT
+33.4.0-jre

Review Comment:
   If feasible, it's certainly ok. However, I have a few questions regarding 
this: 
   
   1. How should the version numbers of other dependencies be updated? Do they 
need to be consistent with Spark? For instance, the current Spark uses Scala 
2.13.16, but this project is still using 2.13.15. 
   
   2. During the release process, after changing the Spark version (e.g., from 
4.0.0-SNAPSHOT to 4.0.0), is it necessary to check the `build` of this project? 
   
   3. Since it aims to be independent project, why don't we choose to maintain 
this examples project in a separate branch(no Spark code whatsoever), or even 
create a separate repository like `spark-connect-examples`? If it is an 
independent repository, would it be more convenient to also include examples 
for clients in other programming languages, such as Go or Swift?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][PYTHON] Generate with a newline at the end of error-conditions.json [spark]

2025-03-12 Thread via GitHub



HyukjinKwon closed pull request #50254: [MINOR][PYTHON] Generate with a newline 
at the end of error-conditions.json
URL: https://github.com/apache/spark/pull/50254


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [MINOR] Add compatibility badges [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun opened a new pull request, #12:
URL: https://github.com/apache/spark-connect-swift/pull/12

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][PYTHON] Generate with a newline at the end of error-conditions.json [spark]

2025-03-12 Thread via GitHub



HyukjinKwon commented on PR #50254:
URL: https://github.com/apache/spark/pull/50254#issuecomment-2718496912

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48922][SQL] Avoid redundant array transform of identical expression for map type [spark]

2025-03-12 Thread via GitHub



wForget commented on PR #50245:
URL: https://github.com/apache/spark/pull/50245#issuecomment-2716878368

   @viirya @dongjoon-hyun @kazuyukitanimura could you please take a look? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]

2025-03-12 Thread via GitHub



LuciferYang commented on code in PR #50255:
URL: https://github.com/apache/spark/pull/50255#discussion_r1991919622


##
dev/spark-test-image/python-313/Dockerfile:
##
@@ -67,7 +67,7 @@ RUN apt-get update && apt-get install -y \
 && rm -rf /var/lib/apt/lists/*
 
 
-ARG BASIC_PIP_PKGS="numpy pyarrow>=18.0.0 six==1.16.0 pandas==2.2.3 scipy 
plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2"
+ARG BASIC_PIP_PKGS="numpy pyarrow>=19.0.0 six==1.16.0 pandas==2.2.3 scipy 
plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2"
 ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 protobuf==5.29.1 
googleapis-common-protos==1.65.0 graphviz==0.20.3"

Review Comment:
   And
   
   
https://github.com/apache/spark/blob/d87851f8a941b74e2cd198f7e7c4a339061f25e4/dev/infra/Dockerfile#L97
   
   
https://github.com/apache/spark/blob/d87851f8a941b74e2cd198f7e7c4a339061f25e4/dev/infra/Dockerfile#L151



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51472] Add gRPC `SparkConnectClient` actor [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun closed pull request #7: [SPARK-51472] Add gRPC 
`SparkConnectClient` actor
URL: https://github.com/apache/spark-connect-swift/pull/7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51481] Add `RuntimeConf` actor [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #9:
URL: 
https://github.com/apache/spark-connect-swift/pull/9#issuecomment-2716593313

   Please wait for a while until I finish the first migration~ `4.1.0` will be 
replaced properly in a week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR] Add compatibility badges [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #12:
URL: 
https://github.com/apache/spark-connect-swift/pull/12#issuecomment-2718610578

   Thank you! Merged to main.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR] Add compatibility badges [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun closed pull request #12: [MINOR] Add compatibility badges
URL: https://github.com/apache/spark-connect-swift/pull/12


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51350][SQL] Implement Show Procedures [spark]

2025-03-12 Thread via GitHub



zhengruifeng commented on code in PR #50109:
URL: https://github.com/apache/spark/pull/50109#discussion_r1991972859


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ShowProceduresCommand.scala:
##
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.catalyst.analysis.ResolvedNamespace
+import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+import org.apache.spark.sql.types.StringType
+
+/**
+ * A command for users to get procedures.
+ * If a namespace is not given, the current namespace will be used.
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW TABLES [(IN|FROM) namespace]]

Review Comment:
   ```suggestion
*   SHOW PROCEDURES [(IN|FROM) namespace]]
   ```



##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ShowProceduresCommand.scala:
##
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.catalyst.analysis.ResolvedNamespace
+import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+import org.apache.spark.sql.types.StringType
+
+/**
+ * A command for users to get procedures.
+ * If a namespace is not given, the current namespace will be used.
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW TABLES [(IN|FROM) namespace]]
+ * }}}
+ */
+case class ShowProceduresCommand(
+child: LogicalPlan,
+override val output: Seq[Attribute] = Seq(
+  AttributeReference("namespace", StringType, nullable = false)(),
+  AttributeReference("procedure_name", StringType, nullable = false)()
+)) extends UnaryRunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+child match {
+  case ResolvedNamespace(catalog, ns, _) =>
+val procedureCatalog = catalog.asProcedureCatalog
+val procedures = procedureCatalog.listProcedures(ns.toArray)
+procedures.toSeq.map(p => Row(p.namespace().quoted, p.name()))

Review Comment:
   if mismatch, do we throw a proper exception other than `scala.MatchError`?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ShowProceduresCommand.scala:
##
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the spe

Re: [PR] [SPARK-51481] Add `RuntimeConf` actor [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #9:
URL: 
https://github.com/apache/spark-connect-swift/pull/9#issuecomment-2718880137

   To @yaooqinn , I applied new `version` scheme for all.
   > By the way, can we unbind the version tracker of this project to the Spark 
main repo? It might be inconvenient for the RM of v4.1.0 to generate the 
release note.
   
   ![Screenshot 2025-03-12 at 12 20 
00](https://github.com/user-attachments/assets/d24fb40b-bc6c-4c82-a290-8d20ba436ea4)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51491][PYTHON] Simplify boxplot with subquery APIs [spark]

2025-03-12 Thread via GitHub



zhengruifeng opened a new pull request, #50258:
URL: https://github.com/apache/spark/pull/50258

   
   
   ### What changes were proposed in this pull request?
   Simplify boxplot with subquery APIs
   
   ### Why are the changes needed?
   1, to make the code simple;
   2, according to my experience, subquery is faster than join in this case
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   ci
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51473][ML][CONNECT] ML transformed dataframe keep a reference to the model [spark]

2025-03-12 Thread via GitHub



zhengruifeng commented on code in PR #50199:
URL: https://github.com/apache/spark/pull/50199#discussion_r1992191679


##
python/pyspark/ml/util.py:
##
@@ -185,29 +185,40 @@ def wrapped(self: "JavaWrapper", dataset: 
"ConnectDataFrame") -> Any:
 
 assert isinstance(self._java_obj, str)
 params = serialize_ml_params(self, session.client)
-return ConnectDataFrame(
-TransformerRelation(
-child=dataset._plan, name=self._java_obj, 
ml_params=params, is_model=True
-),
-session,
+plan = TransformerRelation(
+child=dataset._plan,
+name=self._java_obj,
+ml_params=params,
+is_model=True,
 )
 elif isinstance(self, Transformer):
 from pyspark.ml.connect.proto import TransformerRelation
 
 assert isinstance(self._java_obj, str)
 params = serialize_ml_params(self, session.client)
-return ConnectDataFrame(
-TransformerRelation(
-child=dataset._plan,
-name=self._java_obj,
-ml_params=params,
-uid=self.uid,
-is_model=False,
-),
-session,
+plan = TransformerRelation(

Review Comment:
   ConnectDataFrame is to build the connect DF, it is still used at the end of 
this method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51484][SS] Remove unused function `private def newDFSFileName(String)` from `RocksDBFileManager` [spark]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #50249:
URL: https://github.com/apache/spark/pull/50249#issuecomment-2718779093

   Merged to master. Thank you, @LuciferYang and @beliefer .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]

2025-03-12 Thread via GitHub



attilapiros commented on PR #14:
URL: 
https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2718970521

   I see this connect-swift-x.y.z is already used in the tickets:
   https://github.com/user-attachments/assets/6f7088a6-4f08-4a2a-b44e-d7d30be79eff";
 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51474][SQL] Don't insert redundant ColumnarToRowExec for node supporting both columnar and row output [spark]

2025-03-12 Thread via GitHub



viirya closed pull request #50239: [SPARK-51474][SQL] Don't insert redundant 
ColumnarToRowExec for node supporting both columnar and row output
URL: https://github.com/apache/spark/pull/50239


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51474][SQL] Don't insert redundant ColumnarToRowExec for node supporting both columnar and row output [spark]

2025-03-12 Thread via GitHub



viirya commented on PR #50239:
URL: https://github.com/apache/spark/pull/50239#issuecomment-2717957833

   Thank you @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51482][SQL] Support cast from string to time [spark]

2025-03-12 Thread via GitHub



LuciferYang commented on code in PR #50236:
URL: https://github.com/apache/spark/pull/50236#discussion_r1991323757


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala:
##
@@ -1337,6 +1350,35 @@ case class Cast(
 }
   }
 
+  private[this] def castToTimeCode(
+  from: DataType,
+  ctx: CodegenContext): CastFunction = {
+from match {
+  case _: StringType =>
+val longOpt = ctx.freshVariable("longOpt", classOf[Option[Long]])
+(c, evPrim, evNull) =>
+  if (ansiEnabled) {
+val errorContext = getContextOrNullCode(ctx)
+code"""
+  $evPrim = $dateTimeUtilsCls.stringToTimeAnsi($c, $errorContext);
+"""
+  } else {
+code"""
+  scala.Option $longOpt =
+
org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTime($c);

Review Comment:
   Can also use `$dateTimeUtilsCls.stringToTime($c);` ?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala:
##
@@ -1337,6 +1350,35 @@ case class Cast(
 }
   }
 
+  private[this] def castToTimeCode(
+  from: DataType,
+  ctx: CodegenContext): CastFunction = {
+from match {
+  case _: StringType =>
+val longOpt = ctx.freshVariable("longOpt", classOf[Option[Long]])
+(c, evPrim, evNull) =>
+  if (ansiEnabled) {
+val errorContext = getContextOrNullCode(ctx)
+code"""
+  $evPrim = $dateTimeUtilsCls.stringToTimeAnsi($c, $errorContext);
+"""
+  } else {
+code"""
+  scala.Option $longOpt =
+
org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTime($c);
+  if ($longOpt.isDefined()) {
+$evPrim = ((Long) $longOpt.get()).longValue();
+  } else {
+$evNull = true;
+  }
+"""
+  }
+
+  case _ =>
+(_, evPrim, evNull) => code"$evNull = true;"

Review Comment:
   ```suggestion
   (_, _, evNull) => code"$evNull = true;"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50005][SQL] Enhance method verifyNotReadPath to identify subqueries hidden in the filter conditions. [spark]

2025-03-12 Thread via GitHub



xunxunmimi5577 commented on PR #48640:
URL: https://github.com/apache/spark/pull/48640#issuecomment-2717698592

   @panbingkun Could you please help review this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP][SQL] Support the TIME keyword as a data type [spark]

2025-03-12 Thread via GitHub



MaxGekk opened a new pull request, #50250:
URL: https://github.com/apache/spark/pull/50250

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   By running new tests:
   ```
   $ build/sbt "test:testOnly *DataTypeParserSuite"
   ```
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51482][SQL] Support cast from string to time [spark]

2025-03-12 Thread via GitHub



MaxGekk commented on PR #50236:
URL: https://github.com/apache/spark/pull/50236#issuecomment-2718536341

   Merging to master. Thank you, @LuciferYang @beliefer for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]

2025-03-12 Thread via GitHub



zhengruifeng commented on code in PR #50255:
URL: https://github.com/apache/spark/pull/50255#discussion_r1991952614


##
dev/spark-test-image/python-313/Dockerfile:
##
@@ -67,7 +67,7 @@ RUN apt-get update && apt-get install -y \
 && rm -rf /var/lib/apt/lists/*
 
 
-ARG BASIC_PIP_PKGS="numpy pyarrow>=18.0.0 six==1.16.0 pandas==2.2.3 scipy 
plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2"
+ARG BASIC_PIP_PKGS="numpy pyarrow>=19.0.0 six==1.16.0 pandas==2.2.3 scipy 
plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2"
 ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 protobuf==5.29.1 
googleapis-common-protos==1.65.0 graphviz==0.20.3"

Review Comment:
   spark/dev/infra/Dockerfile is only used for spark 3.5, I think we can let it 
alone



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]

2025-03-12 Thread via GitHub



zhengruifeng commented on PR #50255:
URL: https://github.com/apache/spark/pull/50255#issuecomment-2718596875

   > Will `dev/create-release/spark-rm/Dockerfile` be fixed in a separate pull 
request?
   
   I think we can update/sync dev/create-release/spark-rm/Dockerfile just 
before each release.
   WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]

2025-03-12 Thread via GitHub



LuciferYang commented on code in PR #50255:
URL: https://github.com/apache/spark/pull/50255#discussion_r1991916210


##
dev/spark-test-image/python-313/Dockerfile:
##
@@ -67,7 +67,7 @@ RUN apt-get update && apt-get install -y \
 && rm -rf /var/lib/apt/lists/*
 
 
-ARG BASIC_PIP_PKGS="numpy pyarrow>=18.0.0 six==1.16.0 pandas==2.2.3 scipy 
plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2"
+ARG BASIC_PIP_PKGS="numpy pyarrow>=19.0.0 six==1.16.0 pandas==2.2.3 scipy 
plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2"
 ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 protobuf==5.29.1 
googleapis-common-protos==1.65.0 graphviz==0.20.3"

Review Comment:
   
https://github.com/apache/spark/blob/d87851f8a941b74e2cd198f7e7c4a339061f25e4/dev/spark-test-image/python-313/Dockerfile#L78
   
   Here is another `pyarrow>=18.0.0`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]

2025-03-12 Thread via GitHub



simhadri-g commented on code in PR #50213:
URL: https://github.com/apache/spark/pull/50213#discussion_r1991865569


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -885,20 +884,6 @@ private[hive] class HiveClientImpl(
   // Since HIVE-18238(Hive 3.0.0), the Driver.close function's return type 
changed
   // and the CommandProcessorFactory.clean function removed.
   driver.getClass.getMethod("close").invoke(driver)
-  if (version != hive.v3_0 && version != hive.v3_1 && version != 
hive.v4_0) {

Review Comment:
   Agreed. 
   Hive 2.x and 3.x is EoL. We should move completely to hive 4+.
   
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]

2025-03-12 Thread via GitHub



LuciferYang commented on code in PR #50255:
URL: https://github.com/apache/spark/pull/50255#discussion_r1991919622


##
dev/spark-test-image/python-313/Dockerfile:
##
@@ -67,7 +67,7 @@ RUN apt-get update && apt-get install -y \
 && rm -rf /var/lib/apt/lists/*
 
 
-ARG BASIC_PIP_PKGS="numpy pyarrow>=18.0.0 six==1.16.0 pandas==2.2.3 scipy 
plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2"
+ARG BASIC_PIP_PKGS="numpy pyarrow>=19.0.0 six==1.16.0 pandas==2.2.3 scipy 
plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2"
 ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 protobuf==5.29.1 
googleapis-common-protos==1.65.0 graphviz==0.20.3"

Review Comment:
   And
   
   
https://github.com/apache/spark/blob/d87851f8a941b74e2cd198f7e7c4a339061f25e4/dev/infra/Dockerfile#L151



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51097][SS] Re-introduce RocksDB state store's last uploaded snapshot version instance metrics [spark]

2025-03-12 Thread via GitHub



zecookiez commented on PR #50195:
URL: https://github.com/apache/spark/pull/50195#issuecomment-2718664791

   @HeartSaVioR can you take a quick look at this? Thanks :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP] [SPARK-51097] [SS] Split apart SparkPlan metrics and instance metrics [spark]

2025-03-12 Thread via GitHub



zecookiez closed pull request #50157: [WIP] [SPARK-51097] [SS] Split apart 
SparkPlan metrics and instance metrics
URL: https://github.com/apache/spark/pull/50157


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][SQL] Reuse `dateTimeUtilsCls` in `Cast` [spark]

2025-03-12 Thread via GitHub



LuciferYang commented on PR #50251:
URL: https://github.com/apache/spark/pull/50251#issuecomment-2718508143

   
![image](https://github.com/user-attachments/assets/c86b11c4-1994-4e66-be06-1481a1419e97)
   all passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][SQL] Reuse `dateTimeUtilsCls` in `Cast` [spark]

2025-03-12 Thread via GitHub



LuciferYang closed pull request #50251: [MINOR][SQL] Reuse `dateTimeUtilsCls` 
in `Cast`
URL: https://github.com/apache/spark/pull/50251


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51482][SQL] Support cast from string to time [spark]

2025-03-12 Thread via GitHub



MaxGekk closed pull request #50236: [SPARK-51482][SQL] Support cast from string 
to time
URL: https://github.com/apache/spark/pull/50236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51489][SQL] Represent SQL Script in Spark UI [spark]

2025-03-12 Thread via GitHub



dusantism-db opened a new pull request, #50256:
URL: https://github.com/apache/spark/pull/50256

   
   
   ### What changes were proposed in this pull request?
   Initial representation of SQL Scripts in the Spark UI. This PR introduces a 
`"SQL Script ID"` column in the All Executions table in `SQL / Dataframe` tab 
of Spark UI. Also, it introduces a SQL Script ID label in the single SQL 
Execution page, which is only present if the SQL Execution is executed by a 
script.
   
   To achieve this, this PR also introduces the concept of SQL Script ID, which 
is propagated throughout the script execution. A `Dataset.ofRows` overload 
which supports script id propagation is introduced.
   
   Screenshots:
   
   All Executions table:
   https://github.com/user-attachments/assets/18f5cca9-5748-47f8-93dc-5d7ef6800e7d";
 />
   
   Single Execution page:
   https://github.com/user-attachments/assets/8e8c0f7d-09f8-40ce-bf11-abb28e36aa95";
 />
   
   Non script Single Execution page remains the same:
   https://github.com/user-attachments/assets/123c60ec-a00b-424b-8190-54976b355387";
 />
   
   
   
   ### Why are the changes needed?
   Currently, there is no way to debug / view SQL scripts in the Spark UI.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the Spark UI is updated to include SQL Script data.
   
   
   ### How was this patch tested?
   Existing unit tests, manual UI validation.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51490] Support iOS, watchOS, and tvOS [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun opened a new pull request, #13:
URL: https://github.com/apache/spark-connect-swift/pull/13

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation [spark]

2025-03-12 Thread via GitHub



LuciferYang commented on code in PR #50232:
URL: https://github.com/apache/spark/pull/50232#discussion_r1990820088


##
sql/hive/src/main/java/org/apache/hadoop/hive/ql/exec/HiveFunctionRegistryUtils.java:
##
@@ -0,0 +1,342 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.PrimitiveGrouping;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.hive.serde2.typeinfo.*;
+
+import java.lang.reflect.Method;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+import org.apache.spark.internal.SparkLogger;
+import org.apache.spark.internal.SparkLoggerFactory;
+
+/**
+ * Copy some methods from {@link 
org.apache.hadoop.hive.ql.exec.FunctionRegistry}
+ * to avoid initializing Hive built-in UDFs.
+ * 
+ * The code is based on Hive 2.3.10.
+ */
+public class HiveFunctionRegistryUtils {
+
+  public static final SparkLogger LOG =
+  SparkLoggerFactory.getLogger(HiveFunctionRegistryUtils.class);

Review Comment:
   nit:indentation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51483] Add `SparkSession` and `DataFrame` actors [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #10:
URL: 
https://github.com/apache/spark-connect-swift/pull/10#issuecomment-2716686412

   According to the review comment, I didn't mention `4.0.0 RC2` in `README.md` 
yet.
   - https://github.com/apache/spark-connect-swift/pull/4#discussion_r1988302867
   
   For now, this is supposed to support Apache Spark 4.0.0+ only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51469][SQL] Improve MapKeyDedupPolicy so that avoid calling toString [spark]

2025-03-12 Thread via GitHub



beliefer commented on PR #50235:
URL: https://github.com/apache/spark/pull/50235#issuecomment-2716796972

   ping @dongjoon-hyun @yaooqinn @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #14:
URL: 
https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2719042611

   Thank you, @attilapiros . 
   
   This PR aims to filter when it chooses `Fixed Version`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51358] [SS] Introduce snapshot upload lag detection through StateStoreCoordinator [spark]

2025-03-12 Thread via GitHub



zecookiez commented on code in PR #50123:
URL: https://github.com/apache/spark/pull/50123#discussion_r1992241770


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala:
##
@@ -38,9 +38,20 @@ import org.apache.spark.sql.types.StructType
 import org.apache.spark.unsafe.Platform
 import org.apache.spark.util.{NonFateSharingCache, Utils}
 
+/**
+ * Trait representing events reported from a RocksDB instance.
+ *
+ * The internal RocksDB instance can use a provider with a 
`RocksDBEventListener` reference to
+ * report specific events like snapshot uploads. This should only be used to 
report back to the
+ * coordinator for metrics and monitoring purposes.
+ */
+trait RocksDBEventListener {
+  def reportSnapshotUploaded(version: Long): Unit
+}
+
 private[sql] class RocksDBStateStoreProvider
   extends StateStoreProvider with Logging with Closeable
-  with SupportsFineGrainedReplay {
+  with SupportsFineGrainedReplay with RocksDBEventListener {

Review Comment:
   I'm not fully sure how we can do this with a new class / object. The problem 
that the trait definition solves is that we only have the state store ID and 
query ID saved in the provider. A separate event listener object wouldn't be 
able to access these unless if we want to pass those into the event listener 
object as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]

2025-03-12 Thread via GitHub



attilapiros commented on PR #14:
URL: 
https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2718979814

   Let me do a quick test..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #50255:
URL: https://github.com/apache/spark/pull/50255#issuecomment-2719073678

   I switched the issue type from `Improvement` to `Test`. Feel free to switch 
back if it looks improper to you, @zhengruifeng .
   
   For the code wise, I'm +1 if this doesn't cause any outage in `master` and 
`branch-4.0` (commit and Daily builder).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #14:
URL: 
https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2719045920

   To test, you can try to merge this PR (SPARK-51493) with this script.
   https://github.com/user-attachments/assets/eed46cc9-cbea-44ed-909a-10d640e3a678";
 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51473][ML][CONNECT] ML transformed dataframe keep a reference to the model [spark]

2025-03-12 Thread via GitHub



zhengruifeng commented on code in PR #50199:
URL: https://github.com/apache/spark/pull/50199#discussion_r1992274916


##
python/pyspark/ml/util.py:
##
@@ -185,29 +185,40 @@ def wrapped(self: "JavaWrapper", dataset: 
"ConnectDataFrame") -> Any:
 
 assert isinstance(self._java_obj, str)
 params = serialize_ml_params(self, session.client)
-return ConnectDataFrame(
-TransformerRelation(
-child=dataset._plan, name=self._java_obj, 
ml_params=params, is_model=True
-),
-session,
+plan = TransformerRelation(
+child=dataset._plan,
+name=self._java_obj,
+ml_params=params,
+is_model=True,
 )
 elif isinstance(self, Transformer):
 from pyspark.ml.connect.proto import TransformerRelation
 
 assert isinstance(self._java_obj, str)
 params = serialize_ml_params(self, session.client)
-return ConnectDataFrame(
-TransformerRelation(
-child=dataset._plan,
-name=self._java_obj,
-ml_params=params,
-uid=self.uid,
-is_model=False,
-),
-session,
+plan = TransformerRelation(
+child=dataset._plan,
+name=self._java_obj,
+ml_params=params,
+uid=self.uid,
+is_model=False,
 )
+
 else:
 raise RuntimeError(f"Unsupported {self}")
+
+# To delay the GC of the model, keep a reference to the source 
transformer
+# in the transformed dataframe and all its descendants.
+# For this case:
+#
+# def fit_transform(df):
+# model = estimator.fit(df)
+# return model.transform(df)
+#
+# output = fit_transform(df)
+#
+plan.__source_transformer__ = self  # type: ignore[attr-defined]

Review Comment:
   we should



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]

2025-03-12 Thread via GitHub



attilapiros commented on PR #50122:
URL: https://github.com/apache/spark/pull/50122#issuecomment-2719130827

   Backport to branch-4.0: https://github.com/apache/spark/pull/50259


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]

2025-03-12 Thread via GitHub



attilapiros commented on PR #50122:
URL: https://github.com/apache/spark/pull/50122#issuecomment-2719136015

   Backport to branch-3.5: https://github.com/apache/spark/pull/50260


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-43221][CORE][4.0] Host local block fetching should use a block status of a block stored on disk [spark]

2025-03-12 Thread via GitHub



attilapiros opened a new pull request, #50259:
URL: https://github.com/apache/spark/pull/50259

   **This is a backport to branch-4.0 from master.**
   
   Thanks for @yorksity who reported this error and even provided a PR for it. 
   This solution very different from https://github.com/apache/spark/pull/40883 
as `BlockManagerMasterEndpoint#getLocationsAndStatus()` needed some refactoring.
   
   ### What changes were proposed in this pull request?
   
   This PR fixes an error which can be manifested in the following exception:
   
   ```
   25/02/20 09:58:31 ERROR util.Utils: [Executor task launch worker for task 
61.0 in stage 67.0 (TID 9391)]: Exception encountered
   java.lang.ArrayIndexOutOfBoundsException: 0
 at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBlocks$1(TorrentBroadcast.scala:185)
 ~[spark-core_2.12-3.3.2.3.3.7190.5-2.jar:3.3.2.3.3.7190.5-2]
 at 
scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) 
~[scala-library-2.12.15.jar:?]
 at scala.collection.immutable.List.foreach(List.scala:431) 
~[scala-library-2.12.15.jar:?]
 at 
org.apache.spark.broadcast.TorrentBroadcast.readBlocks(TorrentBroadcast.scala:171)
 ~[spark-core_2.12-3.3.2.3.3.7190.5-2.jar:3.3.2.3.3.7190.5-2]   
 
   ```
   
   The PR is changing `BlockManagerMasterEndpoint#getLocationsAndStatus()`.
   
   The `BlockManagerMasterEndpoint#getLocationsAndStatus()` function is giving 
back an optional `BlockLocationsAndStatus` which consist of 3 parts:
- `locations`: all the locations where the block can be found (as a 
sequence of block manager IDs)
- `status`: one block status
- `localDirs`: optional directory paths which can be used to read block if 
the block is found in the disk of an executor running on the same host

   The block (either RDD blocks, shuffle blocks or torrent blocks) can be 
stored in many executors with different storage levels: disk or memory.
   
   This PR changing how the block status and the block manager ID for the 
`localDirs` is found to guarantee they belong together.
   
   ### Why are the changes needed?
   
   Before this PR the `BlockManagerMasterEndpoint#getLocationsAndStatus()` was 
searching for the block status (`status`) and the `localDirs` separately. The 
block status actually was computed as the very first one where the block can be 
found. This way it can easily happen this block status was representing an 
in-memory block (where the disk size is 0 as it is stored in the memory) but 
the `localDirs` was filled out based on a host local block instance which was 
stored on disk.
   
   This situation can be very frequent but only causing problems (exceptions as 
above) when encryption is on (spark.io.encryption.enabled=true) as for a not 
encrypted block the whole file containing the block is read, see
   
https://github.com/apache/spark/blob/branch-3.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1244
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Host local block fetching was already covered by some existing unit tests 
but a new unit test is provided for this exact case: "SPARK-43221: Host local 
block fetching should use a block status with disk size".
   
   The number of block mangers and the order of the blocks was chosen after 
some experimentation as the block status order is depends on a `HashSet`, see:
   ```
 private val blockLocations = new JHashMap[BlockId, 
mutable.HashSet[BlockManagerId]]
   ```
   
   This test was executed with the old code too to validate the issue is 
reproduced:
   ```
   BlockManagerSuite:
   OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader 
classes because bootstrap classpath has been appended
   - SPARK-43221: Host local block fetching should use a block status with disk 
size *** FAILED ***
 0 was not greater than 0 The block size must be greater than 0 for a 
nonempty block! (BlockManagerSuite.scala:491)
   Run completed in 6 seconds, 705 milliseconds.
   Total number of tests run: 1
   Suites: completed 1, aborted 0
   Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
   *** 1 TEST FAILED ***
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   
   (cherry picked from commit 997e599738ebc3b1e1df6a313cd222bb70ef481d)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #14:
URL: 
https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2719150536

   Thank you so much, @attilapiros !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]

2025-03-12 Thread via GitHub



simhadri-g commented on code in PR #50213:
URL: https://github.com/apache/spark/pull/50213#discussion_r1990961751


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -885,20 +884,6 @@ private[hive] class HiveClientImpl(
   // Since HIVE-18238(Hive 3.0.0), the Driver.close function's return type 
changed
   // and the CommandProcessorFactory.clean function removed.
   driver.getClass.getMethod("close").invoke(driver)
-  if (version != hive.v3_0 && version != hive.v3_1 && version != 
hive.v4_0) {

Review Comment:
   This would break backward compatibility when spark tries to connect with 
older version of Hive right? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]

2025-03-12 Thread via GitHub



Ngone51 commented on code in PR #50122:
URL: https://github.com/apache/spark/pull/50122#discussion_r1990890014


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -862,31 +862,50 @@ class BlockManagerMasterEndpoint(
   private def getLocationsAndStatus(
   blockId: BlockId,
   requesterHost: String): Option[BlockLocationsAndStatus] = {
-val locations = 
Option(blockLocations.get(blockId)).map(_.toSeq).getOrElse(Seq.empty)
-val status = locations.headOption.flatMap { bmId =>
-  if (externalShuffleServiceRddFetchEnabled && bmId.port == 
externalShuffleServicePort) {
-blockStatusByShuffleService.get(bmId).flatMap(m => m.get(blockId))
-  } else {
-blockManagerInfo.get(bmId).flatMap(_.getStatus(blockId))
+val allLocations = 
Option(blockLocations.get(blockId)).map(_.toSeq).getOrElse(Seq.empty)
+val hostLocalLocations = allLocations.filter(bmId => bmId.host == 
requesterHost)
+
+val blockStatusWithBlockManagerId: Option[(BlockStatus, BlockManagerId)] =
+  (if (externalShuffleServiceRddFetchEnabled) {
+ // if fetching RDD is enabled from the external shuffle service then 
first try to find
+ // the block in the external shuffle service of the same host
+ val location = hostLocalLocations.find(_.port == 
externalShuffleServicePort)
+ location
+   .flatMap(blockStatusByShuffleService.get(_).flatMap(_.get(blockId)))
+   .zip(location)
+   } else {
+ None
+   })
+.orElse {
+  // if the block is not found via the external shuffle service trying 
to find it in the
+  // executors running on the same host and persisted on the disk
+  // using flatMap on iterators makes the transformation lazy
+  hostLocalLocations.iterator
+.flatMap { bmId =>
+  blockManagerInfo.get(bmId).flatMap { blockInfo =>
+blockInfo.getStatus(blockId).map((_, bmId))
+  }
+}
+.find(_._1.storageLevel.useDisk)

Review Comment:
   Do we enforce the disk first against the memory after this change? Looks 
like a behaviour change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation [spark]

2025-03-12 Thread via GitHub



pan3793 commented on code in PR #50232:
URL: https://github.com/apache/spark/pull/50232#discussion_r1990737841


##
sql/hive-thriftserver/pom.xml:
##
@@ -148,16 +148,6 @@
   byte-buddy-agent
   test
 
-

Review Comment:
   Done by updating the `SparkBuild.scala`, it could be verified by
   ```
   build/sbt -Phive-thriftserver hive-thriftserver/Test/dependencyTree | grep 
hive-llap
   ```
   
   before
   ```
   [info]   +-org.apache.hive:hive-llap-client:2.3.10
   [info]   +-org.apache.hive:hive-llap-common:2.3.10
   [info]   | +-org.apache.hive:hive-llap-client:2.3.10
   [info]   | +-org.apache.hive:hive-llap-common:2.3.10
   ```
   
   now result is empty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51365][TESTS] Test maven + macos [spark]

2025-03-12 Thread via GitHub



LuciferYang closed pull request #50178: [SPARK-51365][TESTS] Test maven + macos
URL: https://github.com/apache/spark/pull/50178


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48922][SQL] Avoid redundant array transform of identical expression for map type [spark]

2025-03-12 Thread via GitHub



beliefer commented on code in PR #50245:
URL: https://github.com/apache/spark/pull/50245#discussion_r1991031211


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala:
##
@@ -459,11 +459,28 @@ object TableOutputResolver extends SQLConfHelper with 
Logging {
 }
 
 if (resKey.length == 1 && resValue.length == 1) {
-  val keyFunc = LambdaFunction(resKey.head, Seq(keyParam))
-  val valueFunc = LambdaFunction(resValue.head, Seq(valueParam))
-  val newKeys = ArrayTransform(MapKeys(nullCheckedInput), keyFunc)
-  val newValues = ArrayTransform(MapValues(nullCheckedInput), valueFunc)
-  Some(Alias(MapFromArrays(newKeys, newValues), expected.name)())
+  // If the key and value expressions have not changed, we just check 
original map field.
+  // Otherwise, we construct a new map by adding transformations to the 
keys and values.
+  if (resKey.head == keyParam && resValue.head == valueParam) {
+Some(
+  Alias(nullCheckedInput, expected.name)(
+nonInheritableMetadataKeys =
+  Seq(CharVarcharUtils.CHAR_VARCHAR_TYPE_STRING_METADATA_KEY)))

Review Comment:
   @viirya Could you help me understand why remove the 
`CHAR_VARCHAR_TYPE_STRING_METADATA_KEY` from metadata?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48922][SQL] Avoid redundant array transform of identical expression for map type [spark]

2025-03-12 Thread via GitHub



beliefer commented on code in PR #50245:
URL: https://github.com/apache/spark/pull/50245#discussion_r1990965674


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala:
##
@@ -459,11 +459,28 @@ object TableOutputResolver extends SQLConfHelper with 
Logging {
 }
 
 if (resKey.length == 1 && resValue.length == 1) {
-  val keyFunc = LambdaFunction(resKey.head, Seq(keyParam))
-  val valueFunc = LambdaFunction(resValue.head, Seq(valueParam))
-  val newKeys = ArrayTransform(MapKeys(nullCheckedInput), keyFunc)
-  val newValues = ArrayTransform(MapValues(nullCheckedInput), valueFunc)
-  Some(Alias(MapFromArrays(newKeys, newValues), expected.name)())
+  // If the key and value expressions have not changed, we just check 
original map field.
+  // Otherwise, we construct a new map by adding transformations to the 
keys and values.
+  if (resKey.head == keyParam && resValue.head == valueParam) {
+Some(
+  Alias(nullCheckedInput, expected.name)(
+nonInheritableMetadataKeys =
+  Seq(CharVarcharUtils.CHAR_VARCHAR_TYPE_STRING_METADATA_KEY)))

Review Comment:
   Could you explain why you want remove the 
`CHAR_VARCHAR_TYPE_STRING_METADATA_KEY` from metadata?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49485][CORE] Fix speculative task hang bug due to remaining executor lay on same host [spark]

2025-03-12 Thread via GitHub



buska88 commented on code in PR #47949:
URL: https://github.com/apache/spark/pull/47949#discussion_r1990965346


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -308,13 +328,59 @@ private[spark] class ExecutorAllocationManager(
   tasksPerExecutor).toInt
 
 val maxNeededWithSpeculationLocalityOffset =
-  if (tasksPerExecutor > 1 && maxNeeded == 1 && pendingSpeculative > 0) {
-  // If we have pending speculative tasks and only need a single executor, 
allocate one more
-  // to satisfy the locality requirements of speculation
-  maxNeeded + 1
-} else {
-  maxNeeded
-}
+  if (pendingSpeculative > 0 && maxNeeded <= numExecutorsTarget &&
+executorMonitor.executorHostsCount == 1 && 
scheduler.isInstanceOf[TaskSchedulerImpl]) {
+val now = clock.nanoTime()
+if (excludeNodesTriggerTime == NOT_SET) {
+  excludeNodesTriggerTime = now + 
TimeUnit.MINUTES.toNanos(excludeNodeTriggerTimeoutMin)
+  logDebug(log"Current timestamp ${MDC(TIMESTAMP, now)}, " +
+log"set excludeNodesTriggerTime to ${MDC(TIMESTAMP, 
excludeNodesTriggerTime)}")
+  maxNeeded
+} else if (now < excludeNodesTriggerTime) {
+  maxNeeded
+} else {
+  if (executorMonitor.hasAdjustMaxNeed) {
+adjustMaxNeed
+  } else {
+logDebug(log"${MDC(TIMESTAMP, now)} exceeds" +
+  log" ${MDC(TIMESTAMP, excludeNodesTriggerTime)}, start exclude 
node!")
+val node = executorMonitor.getExecutorHostsName(0)
+// check if current remaining host has attempts of speculative task
+val speculativeTasksInfo = 
listener.getPendingSpeculativeTasksInfo()
+if (scheduler.asInstanceOf[TaskSchedulerImpl].
+  speculativeTasksHasAttemptOnHost(node, speculativeTasksInfo)) {
+  // make sure new maxNeed exceeds numExecutorsTarget and allocate 
executor
+  adjustMaxNeed = numExecutorsTarget + 1
+  // If hasAdjustMaxNeed, use last adjust value as
+  // maxNeededWithSpeculationLocalityOffset in case of 
numExecutorsTarget keeps
+  // increasing during maxNumExecutorsNeededPerResourceProfile 
method called
+  if 
(scheduler.asInstanceOf[TaskSchedulerImpl].handleExcludeNodes(node)) {
+logDebug(log"Exclude ${MDC(HOST, node)} for speculative, " +
+  log"old maxNeeded: ${MDC(COUNT, maxNeeded)}, " +
+  log"old numExecutorsTarget: ${MDC(COUNT, 
numExecutorsTarget)}, " +
+  log"current executors count: ${MDC(COUNT, 
executorMonitor.executorCount)}")
+excludeSpeculativeNodes.add(node)
+executorMonitor.setAdjustMaxNeed(true)
+adjustMaxNeed
+  } else {
+logDebug(log"${MDC(HOST, node)} has been excluded for other 
reason")
+maxNeeded
+  }
+} else {
+  logDebug(log"No speculative task found on ${MDC(HOST, node)}")
+  maxNeeded
+}
+  }
+}
+  } else if (tasksPerExecutor > 1 && maxNeeded == 1 && pendingSpeculative 
> 0) {
+resetExcludeNodesTriggerTime()
+// If we have pending speculative tasks and only need a single 
executor, allocate one more
+// to satisfy the locality requirements of speculation
+maxNeeded + 1
+  } else {
+resetExcludeNodesTriggerTime()
+maxNeeded
+  }

Review Comment:
   Thanks for reviewing.I have noticed the review message today.What if make 
this feature configurable?
   Users are more intolerant of being stuck than of wastage in some cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48922][SQL] Avoid redundant array transform of identical expression for map type [spark]

2025-03-12 Thread via GitHub



wForget commented on code in PR #50245:
URL: https://github.com/apache/spark/pull/50245#discussion_r1990969615


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala:
##
@@ -459,11 +459,28 @@ object TableOutputResolver extends SQLConfHelper with 
Logging {
 }
 
 if (resKey.length == 1 && resValue.length == 1) {
-  val keyFunc = LambdaFunction(resKey.head, Seq(keyParam))
-  val valueFunc = LambdaFunction(resValue.head, Seq(valueParam))
-  val newKeys = ArrayTransform(MapKeys(nullCheckedInput), keyFunc)
-  val newValues = ArrayTransform(MapValues(nullCheckedInput), valueFunc)
-  Some(Alias(MapFromArrays(newKeys, newValues), expected.name)())
+  // If the key and value expressions have not changed, we just check 
original map field.
+  // Otherwise, we construct a new map by adding transformations to the 
keys and values.
+  if (resKey.head == keyParam && resValue.head == valueParam) {
+Some(
+  Alias(nullCheckedInput, expected.name)(
+nonInheritableMetadataKeys =
+  Seq(CharVarcharUtils.CHAR_VARCHAR_TYPE_STRING_METADATA_KEY)))

Review Comment:
   I just want to be consistent with #47843



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51484][SS] Remove unused function `private def newDFSFileName(String)` from `RocksDBFileManager` [spark]

2025-03-12 Thread via GitHub



LuciferYang opened a new pull request, #50249:
URL: https://github.com/apache/spark/pull/50249

   ### What changes were proposed in this pull request?
   This pr aims to remove unused function `private def newDFSFileName(String)` 
from `RocksDBFileManager`, It is no longer used after SPARK-49770 
(https://github.com/apache/spark/pull/47875)
   
   
   
   ### Why are the changes needed?
   Code cleanup.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Pass GitHub Actions
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51482][SQL] Support cast from string to time [spark]

2025-03-12 Thread via GitHub



MaxGekk commented on PR #50236:
URL: https://github.com/apache/spark/pull/50236#issuecomment-2717411710

   @yaooqinn @LuciferYang @dongjoon-hyun @HyukjinKwon Could you take a look at 
the PR when you have time, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51482][SQL] Support cast from string to time [spark]

2025-03-12 Thread via GitHub



MaxGekk commented on code in PR #50236:
URL: https://github.com/apache/spark/pull/50236#discussion_r1991652407


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala:
##
@@ -1337,6 +1350,35 @@ case class Cast(
 }
   }
 
+  private[this] def castToTimeCode(
+  from: DataType,
+  ctx: CodegenContext): CastFunction = {
+from match {
+  case _: StringType =>
+val longOpt = ctx.freshVariable("longOpt", classOf[Option[Long]])
+(c, evPrim, evNull) =>
+  if (ansiEnabled) {
+val errorContext = getContextOrNullCode(ctx)
+code"""
+  $evPrim = $dateTimeUtilsCls.stringToTimeAnsi($c, $errorContext);
+"""
+  } else {
+code"""
+  scala.Option $longOpt =
+
org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTime($c);

Review Comment:
   Fixed, and changed other places too. Please, review the PR: 
https://github.com/apache/spark/pull/50251



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][PYTHON] Generate with a newline at the end of error-conditions.json [spark]

2025-03-12 Thread via GitHub



HyukjinKwon commented on PR #50254:
URL: https://github.com/apache/spark/pull/50254#issuecomment-2718324059

   cc @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51338][INFRA] Add automated CI build for `connect-examples` [spark]

2025-03-12 Thread via GitHub



vicennial commented on code in PR #50187:
URL: https://github.com/apache/spark/pull/50187#discussion_r1991794909


##
connect-examples/server-library-example/pom.xml:
##
@@ -36,7 +36,8 @@
 UTF-8
 2.13
 2.13.15
-3.25.4
-4.0.0-preview2
+4.29.3
+4.1.0-SNAPSHOT
+33.4.0-jre

Review Comment:
   What if we update the release script and add a rule/command to auto-change 
the project version in this pom file as well? This way, we can satisfy both 
continuous build compatibility with Spark and be somewhat independent (modulo 
the dependency on the ASF snapshot repo). 
   
   I'd like to avoid inheriting the parent pom as that would lead to the 
project pulling in Spark's default shading rules, version definitions etc. In 
this specific case, it wouldn't be favourable as it's intended to demonstrate 
the extension's development using a minimal set of dependencies (spark-sql-api, 
spark-connect-client, etc.). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]

2025-03-12 Thread via GitHub



simhadri-g commented on code in PR #50213:
URL: https://github.com/apache/spark/pull/50213#discussion_r1990961751


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -885,20 +884,6 @@ private[hive] class HiveClientImpl(
   // Since HIVE-18238(Hive 3.0.0), the Driver.close function's return type 
changed
   // and the CommandProcessorFactory.clean function removed.
   driver.getClass.getMethod("close").invoke(driver)
-  if (version != hive.v3_0 && version != hive.v3_1 && version != 
hive.v4_0) {

Review Comment:
   This would break backward compatibility when spark tries to connect with 
older versions of Hive right? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [MINOR][PYTHON] Generate with a newline at the end of error-conditions.json [spark]

2025-03-12 Thread via GitHub



HyukjinKwon opened a new pull request, #50254:
URL: https://github.com/apache/spark/pull/50254

   ### What changes were proposed in this pull request?
   
   This PR proposes to generate with a newline at the end of 
`error-conditions.json`.
   
   ### Why are the changes needed?
   
   To make the styles same as other Python files.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Existing tests in CI.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][DOC] Improve the documentation for configuration. [spark]

2025-03-12 Thread via GitHub



0xbadidea closed pull request #50247: [MINOR][DOC] Improve the documentation 
for configuration.
URL: https://github.com/apache/spark/pull/50247


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]

2025-03-12 Thread via GitHub



deniskuzZ commented on code in PR #50213:
URL: https://github.com/apache/spark/pull/50213#discussion_r1991644845


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala:
##
@@ -1210,7 +1210,7 @@ class HiveQuerySuite extends HiveComparisonTest with 
SQLTestUtils with BeforeAnd
 .zip(parts)
 .map { case (k, v) =>
   if (v == "NULL") {
-s"$k=${ConfVars.DEFAULTPARTITIONNAME.defaultStrVal}"
+s"$k=${ConfVars.DEFAULTPARTITIONNAME.getDefaultVal}"

Review Comment:
   you should use `org.apache.hadoop.hive.conf.HiveConf.DEFAULT_PARTITION_NAME`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50188][CONNECT][PYTHON] When the connect client starts, print the server's webUrl [spark]

2025-03-12 Thread via GitHub



panbingkun commented on PR #48720:
URL: https://github.com/apache/spark/pull/48720#issuecomment-2716707617

   > +1 for the idea. Could you rebase this once more, @panbingkun ?
   
   Sorry, I have been struggling with a new job recently,
   and it has been updated!
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [MINOR][SQL] Reuse `dateTimeUtilsCls` in `Cast` [spark]

2025-03-12 Thread via GitHub



MaxGekk opened a new pull request, #50251:
URL: https://github.com/apache/spark/pull/50251

   ### What changes were proposed in this pull request?
   In the PR, I propose to re-use the method `dateTimeUtilsCls()` instead of 
`org.apache.spark.sql.catalyst.util.DateTimeUtils` in the `Cast` expression.
   
   ### Why are the changes needed?
   1. To make code consistent: in some places we use `dateTimeUtilsCls()`, but 
in others `org.apache.spark.sql.catalyst.util.DateTimeUtils`.
   2. To improve code maintenance:  `DateTimeUtils` can be moved to another 
package w/o changing the code generation in `Cast`.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   By existing GAs.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51485] Add `How to use in your apps` section to `README.md` [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun opened a new pull request, #11:
URL: https://github.com/apache/spark-connect-swift/pull/11

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51441][SQL] Add DSv2 APIs for constraints [spark]

2025-03-12 Thread via GitHub



aokolnychyi opened a new pull request, #50253:
URL: https://github.com/apache/spark/pull/50253

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR adds DSv2 APIs for constraints as per SPIP 
[doc](https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo/).
   
   ### Why are the changes needed?
   
   
   These changes are the first step for constraints support in Spark. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No, just interfaces.
   
   ### How was this patch tested?
   
   
   This PR comes with tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][PYTHON] Reformat error classes [spark]

2025-03-12 Thread via GitHub



HyukjinKwon commented on PR #50241:
URL: https://github.com/apache/spark/pull/50241#issuecomment-2718308691

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][PYTHON] Reformat error classes [spark]

2025-03-12 Thread via GitHub



HyukjinKwon commented on code in PR #50241:
URL: https://github.com/apache/spark/pull/50241#discussion_r1991779317


##
python/pyspark/errors/error-conditions.json:
##
@@ -1208,4 +1208,4 @@
   "Index must be non-zero."
 ]
   }
-}
+}

Review Comment:
   it is a generated file. Maybe we should fix the generation. I will make a PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][PYTHON] Reformat error classes [spark]

2025-03-12 Thread via GitHub



HyukjinKwon closed pull request #50241: [MINOR][PYTHON] Reformat error classes
URL: https://github.com/apache/spark/pull/50241


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51271][PYTHON] Filter serialization for Python Data Source filter pushdown [spark]

2025-03-12 Thread via GitHub



wengh opened a new pull request, #50252:
URL: https://github.com/apache/spark/pull/50252

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51476][PYTHON][DOCS] Update document type conversion for Pandas UDFs (pyarrow 17.0.0, pandas 2.2.3, Python 3.11) [spark]

2025-03-12 Thread via GitHub



HyukjinKwon closed pull request #50242: [SPARK-51476][PYTHON][DOCS] Update 
document type conversion for Pandas UDFs (pyarrow 17.0.0, pandas 2.2.3, Python 
3.11)
URL: https://github.com/apache/spark/pull/50242


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]

2025-03-12 Thread via GitHub



Ngone51 commented on code in PR #50122:
URL: https://github.com/apache/spark/pull/50122#discussion_r1990890014


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -862,31 +862,50 @@ class BlockManagerMasterEndpoint(
   private def getLocationsAndStatus(
   blockId: BlockId,
   requesterHost: String): Option[BlockLocationsAndStatus] = {
-val locations = 
Option(blockLocations.get(blockId)).map(_.toSeq).getOrElse(Seq.empty)
-val status = locations.headOption.flatMap { bmId =>
-  if (externalShuffleServiceRddFetchEnabled && bmId.port == 
externalShuffleServicePort) {
-blockStatusByShuffleService.get(bmId).flatMap(m => m.get(blockId))
-  } else {
-blockManagerInfo.get(bmId).flatMap(_.getStatus(blockId))
+val allLocations = 
Option(blockLocations.get(blockId)).map(_.toSeq).getOrElse(Seq.empty)
+val hostLocalLocations = allLocations.filter(bmId => bmId.host == 
requesterHost)
+
+val blockStatusWithBlockManagerId: Option[(BlockStatus, BlockManagerId)] =
+  (if (externalShuffleServiceRddFetchEnabled) {
+ // if fetching RDD is enabled from the external shuffle service then 
first try to find
+ // the block in the external shuffle service of the same host
+ val location = hostLocalLocations.find(_.port == 
externalShuffleServicePort)
+ location
+   .flatMap(blockStatusByShuffleService.get(_).flatMap(_.get(blockId)))
+   .zip(location)
+   } else {
+ None
+   })
+.orElse {
+  // if the block is not found via the external shuffle service trying 
to find it in the
+  // executors running on the same host and persisted on the disk
+  // using flatMap on iterators makes the transformation lazy
+  hostLocalLocations.iterator
+.flatMap { bmId =>
+  blockManagerInfo.get(bmId).flatMap { blockInfo =>
+blockInfo.getStatus(blockId).map((_, bmId))
+  }
+}
+.find(_._1.storageLevel.useDisk)

Review Comment:
   Do we enforce the disk first against the memory after this change? Looks 
like a behaviour change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51467][UI] Make tables of the environment page filterable [spark]

2025-03-12 Thread via GitHub



yaooqinn commented on PR #50233:
URL: https://github.com/apache/spark/pull/50233#issuecomment-2716542978

   > The overloaded action issues might be considered later in an independent 
JIRA (if needed later).
   
   Thank you @dongjoon-hyun.
   
   For potential UX improvements, I'd leave it for further decisions based on 
user feedback.
   
   Merged to master.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51485] Add `How to use in your apps` section to `README.md` [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #11:
URL: 
https://github.com/apache/spark-connect-swift/pull/11#issuecomment-2718378823

   Thank you, @viirya .
   Merged to main.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]

2025-03-12 Thread via GitHub



vrozov commented on code in PR #50213:
URL: https://github.com/apache/spark/pull/50213#discussion_r1991823559


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -885,20 +884,6 @@ private[hive] class HiveClientImpl(
   // Since HIVE-18238(Hive 3.0.0), the Driver.close function's return type 
changed
   // and the CommandProcessorFactory.clean function removed.
   driver.getClass.getMethod("close").invoke(driver)
-  if (version != hive.v3_0 && version != hive.v3_1 && version != 
hive.v4_0) {

Review Comment:
   I don't think it is a connection problem. There will be a problem when 2.x 
Hive jars are provided at runtime. It is not clear to me if it is still 
necessary to support such option granted that 2.x is EOL.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51485] Add `How to use in your apps` section to `README.md` [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun closed pull request #11: [SPARK-51485] Add `How to use in your 
apps` section to `README.md`
URL: https://github.com/apache/spark-connect-swift/pull/11


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51476][PYTHON][DOCS] Update document type conversion for Pandas UDFs (pyarrow 17.0.0, pandas 2.2.3, Python 3.11) [spark]

2025-03-12 Thread via GitHub



HyukjinKwon commented on PR #50242:
URL: https://github.com/apache/spark/pull/50242#issuecomment-2718310865

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]

2025-03-12 Thread via GitHub



dongjoon-hyun commented on PR #14:
URL: 
https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2719154602

   Merged to main~ (I also used this new script to merge this PR.)
   
   ```
   ...
   Would you like to update an associated JIRA? (y/N): y
   Enter a JIRA id [SPARK-51493]:
   === JIRA SPARK-51493 ===
   Summary  Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` 
version
   Assignee None
   Status   Open
   Url  https://issues.apache.org/jira/browse/SPARK-51493
   Affected ['connect-swift-0.1.0']
   
   Check if the JIRA information is as expected (y/N): y
   JIRA is unassigned, choose assignee
   [0] Dongjoon Hyun (Reporter)
   Enter number of user, or userid, to assign to (blank to leave unassigned): 0
   Enter comma-separated fix version(s) [connect-swift-0.1.0]:
   === JIRA SPARK-51493 ===
   Summary  Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` 
version
   Assignee Dongjoon Hyun
   Status   Resolved
   Url  https://issues.apache.org/jira/browse/SPARK-51493
   Affected ['connect-swift-0.1.0']
   Fixed['connect-swift-0.1.0']
   
   Successfully resolved SPARK-51493 with fixVersions=['connect-swift-0.1.0']!
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51358] [SS] Introduce snapshot upload lag detection through StateStoreCoordinator [spark]

2025-03-12 Thread via GitHub



zecookiez commented on code in PR #50123:
URL: https://github.com/apache/spark/pull/50123#discussion_r1992391021


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala:
##
@@ -65,6 +65,7 @@ case object StoreTaskCompletionListener extends 
RocksDBOpType("store_task_comple
  * @param localRootDir Root directory in local disk that is used to working 
and checkpointing dirs
  * @param hadoopConf   Hadoop configuration for talking to the remote file 
system
  * @param loggingIdId that will be prepended in logs for isolating 
concurrent RocksDBs
+ * @param providerListener A reference to the state store provider for event 
callback reporting

Review Comment:
   Hmm okay, will change that part to "The parent RocksDBStateStoreProvider 
object" to help make it more clear :+1:



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala:
##
@@ -38,9 +38,20 @@ import org.apache.spark.sql.types.StructType
 import org.apache.spark.unsafe.Platform
 import org.apache.spark.util.{NonFateSharingCache, Utils}
 
+/**
+ * Trait representing events reported from a RocksDB instance.
+ *
+ * The internal RocksDB instance can use a provider with a 
`RocksDBEventListener` reference to

Review Comment:
   Good idea, that makes more sense than what I had. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-43221][CORE][3.5] Host local block fetching should use a block status of a block stored on disk [spark]

2025-03-12 Thread via GitHub



attilapiros opened a new pull request, #50260:
URL: https://github.com/apache/spark/pull/50260

   **This is a backport to branch-3.5 from master.**
   
   Thanks for @yorksity who reported this error and even provided a PR for it. 
   This solution very different from https://github.com/apache/spark/pull/40883 
as `BlockManagerMasterEndpoint#getLocationsAndStatus()` needed some refactoring.
   
   ### What changes were proposed in this pull request?
   
   This PR fixes an error which can be manifested in the following exception:
   
   ```
   25/02/20 09:58:31 ERROR util.Utils: [Executor task launch worker for task 
61.0 in stage 67.0 (TID 9391)]: Exception encountered
   java.lang.ArrayIndexOutOfBoundsException: 0
 at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBlocks$1(TorrentBroadcast.scala:185)
 ~[spark-core_2.12-3.3.2.3.3.7190.5-2.jar:3.3.2.3.3.7190.5-2]
 at 
scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) 
~[scala-library-2.12.15.jar:?]
 at scala.collection.immutable.List.foreach(List.scala:431) 
~[scala-library-2.12.15.jar:?]
 at 
org.apache.spark.broadcast.TorrentBroadcast.readBlocks(TorrentBroadcast.scala:171)
 ~[spark-core_2.12-3.3.2.3.3.7190.5-2.jar:3.3.2.3.3.7190.5-2]   
 
   ```
   
   The PR is changing `BlockManagerMasterEndpoint#getLocationsAndStatus()`.
   
   The `BlockManagerMasterEndpoint#getLocationsAndStatus()` function is giving 
back an optional `BlockLocationsAndStatus` which consist of 3 parts:
- `locations`: all the locations where the block can be found (as a 
sequence of block manager IDs)
- `status`: one block status
- `localDirs`: optional directory paths which can be used to read block if 
the block is found in the disk of an executor running on the same host

   The block (either RDD blocks, shuffle blocks or torrent blocks) can be 
stored in many executors with different storage levels: disk or memory.
   
   This PR changing how the block status and the block manager ID for the 
`localDirs` is found to guarantee they belong together.
   
   ### Why are the changes needed?
   
   Before this PR the `BlockManagerMasterEndpoint#getLocationsAndStatus()` was 
searching for the block status (`status`) and the `localDirs` separately. The 
block status actually was computed as the very first one where the block can be 
found. This way it can easily happen this block status was representing an 
in-memory block (where the disk size is 0 as it is stored in the memory) but 
the `localDirs` was filled out based on a host local block instance which was 
stored on disk.
   
   This situation can be very frequent but only causing problems (exceptions as 
above) when encryption is on (spark.io.encryption.enabled=true) as for a not 
encrypted block the whole file containing the block is read, see
   
https://github.com/apache/spark/blob/branch-3.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1244
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Host local block fetching was already covered by some existing unit tests 
but a new unit test is provided for this exact case: "SPARK-43221: Host local 
block fetching should use a block status with disk size".
   
   The number of block mangers and the order of the blocks was chosen after 
some experimentation as the block status order is depends on a `HashSet`, see:
   ```
 private val blockLocations = new JHashMap[BlockId, 
mutable.HashSet[BlockManagerId]]
   ```
   
   This test was executed with the old code too to validate the issue is 
reproduced:
   ```
   BlockManagerSuite:
   OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader 
classes because bootstrap classpath has been appended
   - SPARK-43221: Host local block fetching should use a block status with disk 
size *** FAILED ***
 0 was not greater than 0 The block size must be greater than 0 for a 
nonempty block! (BlockManagerSuite.scala:491)
   Run completed in 6 seconds, 705 milliseconds.
   Total number of tests run: 1
   Suites: completed 1, aborted 0
   Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
   *** 1 TEST FAILED ***
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   
   (cherry picked from commit 997e599738ebc3b1e1df6a313cd222bb70ef481d)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 165 matches

Mail list logo