Re: [PR] [SPARK-51490] Support `iOS`, `watchOS`, and `tvOS` [spark-connect-swift]
dongjoon-hyun commented on PR #13: URL: https://github.com/apache/spark-connect-swift/pull/13#issuecomment-2718747123 Thank you so much, @huaxingao ! Merged to main -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51484][SS] Remove unused function `private def newDFSFileName(String)` from `RocksDBFileManager` [spark]
dongjoon-hyun closed pull request #50249: [SPARK-51484][SS] Remove unused function `private def newDFSFileName(String)` from `RocksDBFileManager` URL: https://github.com/apache/spark/pull/50249 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51488][SQL] Support the TIME keyword as a data type [spark]
dongjoon-hyun commented on PR #50250: URL: https://github.com/apache/spark/pull/50250#issuecomment-2718780657 Merged to master. Thank you, @MaxGekk . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation [spark]
dongjoon-hyun closed pull request #50232: [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation URL: https://github.com/apache/spark/pull/50232 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation [spark]
dongjoon-hyun commented on PR #50232: URL: https://github.com/apache/spark/pull/50232#issuecomment-2718785462 Thank you, @pan3793 and @LuciferYang . Merged to master. Could you make a backporting PR to branch-4.0 to pass CI there once more, @pan3793 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]
LuciferYang commented on PR #50255: URL: https://github.com/apache/spark/pull/50255#issuecomment-2718534144 Will `dev/create-release/spark-rm/Dockerfile` be fixed in a separate pull request? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]
dongjoon-hyun closed pull request #50122: [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk URL: https://github.com/apache/spark/pull/50122 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]
dongjoon-hyun commented on PR #50122: URL: https://github.com/apache/spark/pull/50122#issuecomment-2718823408 Thank you, @attilapiros and @Ngone51 . BTW, could you make two backporting PRs to branch-4.0 and branch-3.5 in order to make it sure to pass all CIs there, @attilapiros ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51490] Support `iOS`, `watchOS`, and `tvOS` [spark-connect-swift]
dongjoon-hyun closed pull request #13: [SPARK-51490] Support `iOS`, `watchOS`, and `tvOS` URL: https://github.com/apache/spark-connect-swift/pull/13 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51488][SQL] Support the TIME keyword as a data type [spark]
dongjoon-hyun closed pull request #50250: [SPARK-51488][SQL] Support the TIME keyword as a data type URL: https://github.com/apache/spark/pull/50250 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]
dongjoon-hyun commented on PR #50122: URL: https://github.com/apache/spark/pull/50122#issuecomment-2718824584 Also, cc @mridulm , too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]
deniskuzZ commented on code in PR #50213: URL: https://github.com/apache/spark/pull/50213#discussion_r1991644845 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: ## @@ -1210,7 +1210,7 @@ class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAnd .zip(parts) .map { case (k, v) => if (v == "NULL") { -s"$k=${ConfVars.DEFAULTPARTITIONNAME.defaultStrVal}" +s"$k=${ConfVars.DEFAULTPARTITIONNAME.getDefaultVal}" Review Comment: I think you should use `org.apache.hadoop.hive.conf.HiveConf.ConfVars.DEFAULT_PARTITION_NAME` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z`… [spark-connect-swift]
dongjoon-hyun opened a new pull request, #14: URL: https://github.com/apache/spark-connect-swift/pull/14 … version ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51492][SS]FileStreamSource: Avoid expensive file concatenation if not needed [spark]
siying opened a new pull request, #50257: URL: https://github.com/apache/spark/pull/50257 ### What changes were proposed in this pull request? In two places where constructing a log line can be very or a little bit expensive, avoid logTrace() call as a whole. ### Why are the changes needed? In this statement: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L402 files.mkString("\n\t") can be really expensive if there are many files. It could even cause OOM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Use existing CI ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]
deniskuzZ commented on code in PR #50213: URL: https://github.com/apache/spark/pull/50213#discussion_r1991644845 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: ## @@ -1210,7 +1210,7 @@ class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAnd .zip(parts) .map { case (k, v) => if (v == "NULL") { -s"$k=${ConfVars.DEFAULTPARTITIONNAME.defaultStrVal}" +s"$k=${ConfVars.DEFAULTPARTITIONNAME.getDefaultVal}" Review Comment: I think you should use `org.apache.hadoop.hive.conf.HiveConf.DEFAULT_PARTITION_NAME` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]
dongjoon-hyun commented on PR #14: URL: https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2718898426 Could you review this PR when you have some time, @attilapiros ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]
zhengruifeng opened a new pull request, #50255: URL: https://github.com/apache/spark/pull/50255 ### What changes were proposed in this pull request? Refresh testing images for pyarrow 19 ### Why are the changes needed? to test against the latest pyarrow ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci - `lint` and `python-311` will be tested in PR builder - for other images, we will monitor the daily builders ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]
deniskuzZ commented on code in PR #50213: URL: https://github.com/apache/spark/pull/50213#discussion_r1991860382 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -885,20 +884,6 @@ private[hive] class HiveClientImpl( // Since HIVE-18238(Hive 3.0.0), the Driver.close function's return type changed // and the CommandProcessorFactory.clean function removed. driver.getClass.getMethod("close").invoke(driver) - if (version != hive.v3_0 && version != hive.v3_1 && version != hive.v4_0) { Review Comment: 2.x is ancient and should no longer be supported -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51338][INFRA] Add automated CI build for `connect-examples` [spark]
LuciferYang commented on code in PR #50187: URL: https://github.com/apache/spark/pull/50187#discussion_r1991901327 ## connect-examples/server-library-example/pom.xml: ## @@ -36,7 +36,8 @@ UTF-8 2.13 2.13.15 -3.25.4 -4.0.0-preview2 +4.29.3 +4.1.0-SNAPSHOT +33.4.0-jre Review Comment: If feasible, it's certainly ok. However, I have a few questions regarding this: 1. How should the version numbers of other dependencies be updated? Do they need to be consistent with Spark? For instance, the current Spark uses Scala 2.13.16, but this project is still using 2.13.15. 2. During the release process, after changing the Spark version (e.g., from 4.0.0-SNAPSHOT to 4.0.0), is it necessary to check the `build` of this project? 3. Since it aims to be independent project, why don't we choose to maintain this examples project in a separate branch(no Spark code whatsoever), or even create a separate repository like `spark-connect-examples`? If it is an independent repository, would it be more convenient to also include examples for clients in other programming languages, such as Go or Swift? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][PYTHON] Generate with a newline at the end of error-conditions.json [spark]
HyukjinKwon closed pull request #50254: [MINOR][PYTHON] Generate with a newline at the end of error-conditions.json URL: https://github.com/apache/spark/pull/50254 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [MINOR] Add compatibility badges [spark-connect-swift]
dongjoon-hyun opened a new pull request, #12: URL: https://github.com/apache/spark-connect-swift/pull/12 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][PYTHON] Generate with a newline at the end of error-conditions.json [spark]
HyukjinKwon commented on PR #50254: URL: https://github.com/apache/spark/pull/50254#issuecomment-2718496912 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48922][SQL] Avoid redundant array transform of identical expression for map type [spark]
wForget commented on PR #50245: URL: https://github.com/apache/spark/pull/50245#issuecomment-2716878368 @viirya @dongjoon-hyun @kazuyukitanimura could you please take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]
LuciferYang commented on code in PR #50255: URL: https://github.com/apache/spark/pull/50255#discussion_r1991919622 ## dev/spark-test-image/python-313/Dockerfile: ## @@ -67,7 +67,7 @@ RUN apt-get update && apt-get install -y \ && rm -rf /var/lib/apt/lists/* -ARG BASIC_PIP_PKGS="numpy pyarrow>=18.0.0 six==1.16.0 pandas==2.2.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2" +ARG BASIC_PIP_PKGS="numpy pyarrow>=19.0.0 six==1.16.0 pandas==2.2.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2" ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 protobuf==5.29.1 googleapis-common-protos==1.65.0 graphviz==0.20.3" Review Comment: And https://github.com/apache/spark/blob/d87851f8a941b74e2cd198f7e7c4a339061f25e4/dev/infra/Dockerfile#L97 https://github.com/apache/spark/blob/d87851f8a941b74e2cd198f7e7c4a339061f25e4/dev/infra/Dockerfile#L151 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51472] Add gRPC `SparkConnectClient` actor [spark-connect-swift]
dongjoon-hyun closed pull request #7: [SPARK-51472] Add gRPC `SparkConnectClient` actor URL: https://github.com/apache/spark-connect-swift/pull/7 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51481] Add `RuntimeConf` actor [spark-connect-swift]
dongjoon-hyun commented on PR #9: URL: https://github.com/apache/spark-connect-swift/pull/9#issuecomment-2716593313 Please wait for a while until I finish the first migration~ `4.1.0` will be replaced properly in a week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR] Add compatibility badges [spark-connect-swift]
dongjoon-hyun commented on PR #12: URL: https://github.com/apache/spark-connect-swift/pull/12#issuecomment-2718610578 Thank you! Merged to main. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR] Add compatibility badges [spark-connect-swift]
dongjoon-hyun closed pull request #12: [MINOR] Add compatibility badges URL: https://github.com/apache/spark-connect-swift/pull/12 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51350][SQL] Implement Show Procedures [spark]
zhengruifeng commented on code in PR #50109: URL: https://github.com/apache/spark/pull/50109#discussion_r1991972859 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ShowProceduresCommand.scala: ## @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.catalyst.analysis.ResolvedNamespace +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ +import org.apache.spark.sql.types.StringType + +/** + * A command for users to get procedures. + * If a namespace is not given, the current namespace will be used. + * The syntax of using this command in SQL is: + * {{{ + * SHOW TABLES [(IN|FROM) namespace]] Review Comment: ```suggestion * SHOW PROCEDURES [(IN|FROM) namespace]] ``` ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ShowProceduresCommand.scala: ## @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.catalyst.analysis.ResolvedNamespace +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ +import org.apache.spark.sql.types.StringType + +/** + * A command for users to get procedures. + * If a namespace is not given, the current namespace will be used. + * The syntax of using this command in SQL is: + * {{{ + * SHOW TABLES [(IN|FROM) namespace]] + * }}} + */ +case class ShowProceduresCommand( +child: LogicalPlan, +override val output: Seq[Attribute] = Seq( + AttributeReference("namespace", StringType, nullable = false)(), + AttributeReference("procedure_name", StringType, nullable = false)() +)) extends UnaryRunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +child match { + case ResolvedNamespace(catalog, ns, _) => +val procedureCatalog = catalog.asProcedureCatalog +val procedures = procedureCatalog.listProcedures(ns.toArray) +procedures.toSeq.map(p => Row(p.namespace().quoted, p.name())) Review Comment: if mismatch, do we throw a proper exception other than `scala.MatchError`? ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ShowProceduresCommand.scala: ## @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the spe
Re: [PR] [SPARK-51481] Add `RuntimeConf` actor [spark-connect-swift]
dongjoon-hyun commented on PR #9: URL: https://github.com/apache/spark-connect-swift/pull/9#issuecomment-2718880137 To @yaooqinn , I applied new `version` scheme for all. > By the way, can we unbind the version tracker of this project to the Spark main repo? It might be inconvenient for the RM of v4.1.0 to generate the release note.  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51491][PYTHON] Simplify boxplot with subquery APIs [spark]
zhengruifeng opened a new pull request, #50258: URL: https://github.com/apache/spark/pull/50258 ### What changes were proposed in this pull request? Simplify boxplot with subquery APIs ### Why are the changes needed? 1, to make the code simple; 2, according to my experience, subquery is faster than join in this case ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51473][ML][CONNECT] ML transformed dataframe keep a reference to the model [spark]
zhengruifeng commented on code in PR #50199: URL: https://github.com/apache/spark/pull/50199#discussion_r1992191679 ## python/pyspark/ml/util.py: ## @@ -185,29 +185,40 @@ def wrapped(self: "JavaWrapper", dataset: "ConnectDataFrame") -> Any: assert isinstance(self._java_obj, str) params = serialize_ml_params(self, session.client) -return ConnectDataFrame( -TransformerRelation( -child=dataset._plan, name=self._java_obj, ml_params=params, is_model=True -), -session, +plan = TransformerRelation( +child=dataset._plan, +name=self._java_obj, +ml_params=params, +is_model=True, ) elif isinstance(self, Transformer): from pyspark.ml.connect.proto import TransformerRelation assert isinstance(self._java_obj, str) params = serialize_ml_params(self, session.client) -return ConnectDataFrame( -TransformerRelation( -child=dataset._plan, -name=self._java_obj, -ml_params=params, -uid=self.uid, -is_model=False, -), -session, +plan = TransformerRelation( Review Comment: ConnectDataFrame is to build the connect DF, it is still used at the end of this method -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51484][SS] Remove unused function `private def newDFSFileName(String)` from `RocksDBFileManager` [spark]
dongjoon-hyun commented on PR #50249: URL: https://github.com/apache/spark/pull/50249#issuecomment-2718779093 Merged to master. Thank you, @LuciferYang and @beliefer . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]
attilapiros commented on PR #14: URL: https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2718970521 I see this connect-swift-x.y.z is already used in the tickets: https://github.com/user-attachments/assets/6f7088a6-4f08-4a2a-b44e-d7d30be79eff"; /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51474][SQL] Don't insert redundant ColumnarToRowExec for node supporting both columnar and row output [spark]
viirya closed pull request #50239: [SPARK-51474][SQL] Don't insert redundant ColumnarToRowExec for node supporting both columnar and row output URL: https://github.com/apache/spark/pull/50239 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51474][SQL] Don't insert redundant ColumnarToRowExec for node supporting both columnar and row output [spark]
viirya commented on PR #50239: URL: https://github.com/apache/spark/pull/50239#issuecomment-2717957833 Thank you @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51482][SQL] Support cast from string to time [spark]
LuciferYang commented on code in PR #50236: URL: https://github.com/apache/spark/pull/50236#discussion_r1991323757 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -1337,6 +1350,35 @@ case class Cast( } } + private[this] def castToTimeCode( + from: DataType, + ctx: CodegenContext): CastFunction = { +from match { + case _: StringType => +val longOpt = ctx.freshVariable("longOpt", classOf[Option[Long]]) +(c, evPrim, evNull) => + if (ansiEnabled) { +val errorContext = getContextOrNullCode(ctx) +code""" + $evPrim = $dateTimeUtilsCls.stringToTimeAnsi($c, $errorContext); +""" + } else { +code""" + scala.Option $longOpt = + org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTime($c); Review Comment: Can also use `$dateTimeUtilsCls.stringToTime($c);` ? ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -1337,6 +1350,35 @@ case class Cast( } } + private[this] def castToTimeCode( + from: DataType, + ctx: CodegenContext): CastFunction = { +from match { + case _: StringType => +val longOpt = ctx.freshVariable("longOpt", classOf[Option[Long]]) +(c, evPrim, evNull) => + if (ansiEnabled) { +val errorContext = getContextOrNullCode(ctx) +code""" + $evPrim = $dateTimeUtilsCls.stringToTimeAnsi($c, $errorContext); +""" + } else { +code""" + scala.Option $longOpt = + org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTime($c); + if ($longOpt.isDefined()) { +$evPrim = ((Long) $longOpt.get()).longValue(); + } else { +$evNull = true; + } +""" + } + + case _ => +(_, evPrim, evNull) => code"$evNull = true;" Review Comment: ```suggestion (_, _, evNull) => code"$evNull = true;" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50005][SQL] Enhance method verifyNotReadPath to identify subqueries hidden in the filter conditions. [spark]
xunxunmimi5577 commented on PR #48640: URL: https://github.com/apache/spark/pull/48640#issuecomment-2717698592 @panbingkun Could you please help review this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SQL] Support the TIME keyword as a data type [spark]
MaxGekk opened a new pull request, #50250: URL: https://github.com/apache/spark/pull/50250 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly *DataTypeParserSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51482][SQL] Support cast from string to time [spark]
MaxGekk commented on PR #50236: URL: https://github.com/apache/spark/pull/50236#issuecomment-2718536341 Merging to master. Thank you, @LuciferYang @beliefer for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]
zhengruifeng commented on code in PR #50255: URL: https://github.com/apache/spark/pull/50255#discussion_r1991952614 ## dev/spark-test-image/python-313/Dockerfile: ## @@ -67,7 +67,7 @@ RUN apt-get update && apt-get install -y \ && rm -rf /var/lib/apt/lists/* -ARG BASIC_PIP_PKGS="numpy pyarrow>=18.0.0 six==1.16.0 pandas==2.2.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2" +ARG BASIC_PIP_PKGS="numpy pyarrow>=19.0.0 six==1.16.0 pandas==2.2.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2" ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 protobuf==5.29.1 googleapis-common-protos==1.65.0 graphviz==0.20.3" Review Comment: spark/dev/infra/Dockerfile is only used for spark 3.5, I think we can let it alone -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]
zhengruifeng commented on PR #50255: URL: https://github.com/apache/spark/pull/50255#issuecomment-2718596875 > Will `dev/create-release/spark-rm/Dockerfile` be fixed in a separate pull request? I think we can update/sync dev/create-release/spark-rm/Dockerfile just before each release. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]
LuciferYang commented on code in PR #50255: URL: https://github.com/apache/spark/pull/50255#discussion_r1991916210 ## dev/spark-test-image/python-313/Dockerfile: ## @@ -67,7 +67,7 @@ RUN apt-get update && apt-get install -y \ && rm -rf /var/lib/apt/lists/* -ARG BASIC_PIP_PKGS="numpy pyarrow>=18.0.0 six==1.16.0 pandas==2.2.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2" +ARG BASIC_PIP_PKGS="numpy pyarrow>=19.0.0 six==1.16.0 pandas==2.2.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2" ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 protobuf==5.29.1 googleapis-common-protos==1.65.0 graphviz==0.20.3" Review Comment: https://github.com/apache/spark/blob/d87851f8a941b74e2cd198f7e7c4a339061f25e4/dev/spark-test-image/python-313/Dockerfile#L78 Here is another `pyarrow>=18.0.0` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]
simhadri-g commented on code in PR #50213: URL: https://github.com/apache/spark/pull/50213#discussion_r1991865569 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -885,20 +884,6 @@ private[hive] class HiveClientImpl( // Since HIVE-18238(Hive 3.0.0), the Driver.close function's return type changed // and the CommandProcessorFactory.clean function removed. driver.getClass.getMethod("close").invoke(driver) - if (version != hive.v3_0 && version != hive.v3_1 && version != hive.v4_0) { Review Comment: Agreed. Hive 2.x and 3.x is EoL. We should move completely to hive 4+. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]
LuciferYang commented on code in PR #50255: URL: https://github.com/apache/spark/pull/50255#discussion_r1991919622 ## dev/spark-test-image/python-313/Dockerfile: ## @@ -67,7 +67,7 @@ RUN apt-get update && apt-get install -y \ && rm -rf /var/lib/apt/lists/* -ARG BASIC_PIP_PKGS="numpy pyarrow>=18.0.0 six==1.16.0 pandas==2.2.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2" +ARG BASIC_PIP_PKGS="numpy pyarrow>=19.0.0 six==1.16.0 pandas==2.2.3 scipy plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2" ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 protobuf==5.29.1 googleapis-common-protos==1.65.0 graphviz==0.20.3" Review Comment: And https://github.com/apache/spark/blob/d87851f8a941b74e2cd198f7e7c4a339061f25e4/dev/infra/Dockerfile#L151 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51097][SS] Re-introduce RocksDB state store's last uploaded snapshot version instance metrics [spark]
zecookiez commented on PR #50195: URL: https://github.com/apache/spark/pull/50195#issuecomment-2718664791 @HeartSaVioR can you take a quick look at this? Thanks :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP] [SPARK-51097] [SS] Split apart SparkPlan metrics and instance metrics [spark]
zecookiez closed pull request #50157: [WIP] [SPARK-51097] [SS] Split apart SparkPlan metrics and instance metrics URL: https://github.com/apache/spark/pull/50157 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SQL] Reuse `dateTimeUtilsCls` in `Cast` [spark]
LuciferYang commented on PR #50251: URL: https://github.com/apache/spark/pull/50251#issuecomment-2718508143  all passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SQL] Reuse `dateTimeUtilsCls` in `Cast` [spark]
LuciferYang closed pull request #50251: [MINOR][SQL] Reuse `dateTimeUtilsCls` in `Cast` URL: https://github.com/apache/spark/pull/50251 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51482][SQL] Support cast from string to time [spark]
MaxGekk closed pull request #50236: [SPARK-51482][SQL] Support cast from string to time URL: https://github.com/apache/spark/pull/50236 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51489][SQL] Represent SQL Script in Spark UI [spark]
dusantism-db opened a new pull request, #50256: URL: https://github.com/apache/spark/pull/50256 ### What changes were proposed in this pull request? Initial representation of SQL Scripts in the Spark UI. This PR introduces a `"SQL Script ID"` column in the All Executions table in `SQL / Dataframe` tab of Spark UI. Also, it introduces a SQL Script ID label in the single SQL Execution page, which is only present if the SQL Execution is executed by a script. To achieve this, this PR also introduces the concept of SQL Script ID, which is propagated throughout the script execution. A `Dataset.ofRows` overload which supports script id propagation is introduced. Screenshots: All Executions table: https://github.com/user-attachments/assets/18f5cca9-5748-47f8-93dc-5d7ef6800e7d"; /> Single Execution page: https://github.com/user-attachments/assets/8e8c0f7d-09f8-40ce-bf11-abb28e36aa95"; /> Non script Single Execution page remains the same: https://github.com/user-attachments/assets/123c60ec-a00b-424b-8190-54976b355387"; /> ### Why are the changes needed? Currently, there is no way to debug / view SQL scripts in the Spark UI. ### Does this PR introduce _any_ user-facing change? Yes, the Spark UI is updated to include SQL Script data. ### How was this patch tested? Existing unit tests, manual UI validation. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51490] Support iOS, watchOS, and tvOS [spark-connect-swift]
dongjoon-hyun opened a new pull request, #13: URL: https://github.com/apache/spark-connect-swift/pull/13 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation [spark]
LuciferYang commented on code in PR #50232: URL: https://github.com/apache/spark/pull/50232#discussion_r1990820088 ## sql/hive/src/main/java/org/apache/hadoop/hive/ql/exec/HiveFunctionRegistryUtils.java: ## @@ -0,0 +1,342 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec; + +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.PrimitiveGrouping; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils; +import org.apache.hadoop.hive.serde2.typeinfo.*; + +import java.lang.reflect.Method; +import java.util.ArrayList; +import java.util.Iterator; +import java.util.List; + +import org.apache.spark.internal.SparkLogger; +import org.apache.spark.internal.SparkLoggerFactory; + +/** + * Copy some methods from {@link org.apache.hadoop.hive.ql.exec.FunctionRegistry} + * to avoid initializing Hive built-in UDFs. + * + * The code is based on Hive 2.3.10. + */ +public class HiveFunctionRegistryUtils { + + public static final SparkLogger LOG = + SparkLoggerFactory.getLogger(HiveFunctionRegistryUtils.class); Review Comment: nit:indentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51483] Add `SparkSession` and `DataFrame` actors [spark-connect-swift]
dongjoon-hyun commented on PR #10: URL: https://github.com/apache/spark-connect-swift/pull/10#issuecomment-2716686412 According to the review comment, I didn't mention `4.0.0 RC2` in `README.md` yet. - https://github.com/apache/spark-connect-swift/pull/4#discussion_r1988302867 For now, this is supposed to support Apache Spark 4.0.0+ only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51469][SQL] Improve MapKeyDedupPolicy so that avoid calling toString [spark]
beliefer commented on PR #50235: URL: https://github.com/apache/spark/pull/50235#issuecomment-2716796972 ping @dongjoon-hyun @yaooqinn @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]
dongjoon-hyun commented on PR #14: URL: https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2719042611 Thank you, @attilapiros . This PR aims to filter when it chooses `Fixed Version`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51358] [SS] Introduce snapshot upload lag detection through StateStoreCoordinator [spark]
zecookiez commented on code in PR #50123: URL: https://github.com/apache/spark/pull/50123#discussion_r1992241770 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala: ## @@ -38,9 +38,20 @@ import org.apache.spark.sql.types.StructType import org.apache.spark.unsafe.Platform import org.apache.spark.util.{NonFateSharingCache, Utils} +/** + * Trait representing events reported from a RocksDB instance. + * + * The internal RocksDB instance can use a provider with a `RocksDBEventListener` reference to + * report specific events like snapshot uploads. This should only be used to report back to the + * coordinator for metrics and monitoring purposes. + */ +trait RocksDBEventListener { + def reportSnapshotUploaded(version: Long): Unit +} + private[sql] class RocksDBStateStoreProvider extends StateStoreProvider with Logging with Closeable - with SupportsFineGrainedReplay { + with SupportsFineGrainedReplay with RocksDBEventListener { Review Comment: I'm not fully sure how we can do this with a new class / object. The problem that the trait definition solves is that we only have the state store ID and query ID saved in the provider. A separate event listener object wouldn't be able to access these unless if we want to pass those into the event listener object as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]
attilapiros commented on PR #14: URL: https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2718979814 Let me do a quick test.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]
dongjoon-hyun commented on PR #50255: URL: https://github.com/apache/spark/pull/50255#issuecomment-2719073678 I switched the issue type from `Improvement` to `Test`. Feel free to switch back if it looks improper to you, @zhengruifeng . For the code wise, I'm +1 if this doesn't cause any outage in `master` and `branch-4.0` (commit and Daily builder). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]
dongjoon-hyun commented on PR #14: URL: https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2719045920 To test, you can try to merge this PR (SPARK-51493) with this script. https://github.com/user-attachments/assets/eed46cc9-cbea-44ed-909a-10d640e3a678"; /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51473][ML][CONNECT] ML transformed dataframe keep a reference to the model [spark]
zhengruifeng commented on code in PR #50199: URL: https://github.com/apache/spark/pull/50199#discussion_r1992274916 ## python/pyspark/ml/util.py: ## @@ -185,29 +185,40 @@ def wrapped(self: "JavaWrapper", dataset: "ConnectDataFrame") -> Any: assert isinstance(self._java_obj, str) params = serialize_ml_params(self, session.client) -return ConnectDataFrame( -TransformerRelation( -child=dataset._plan, name=self._java_obj, ml_params=params, is_model=True -), -session, +plan = TransformerRelation( +child=dataset._plan, +name=self._java_obj, +ml_params=params, +is_model=True, ) elif isinstance(self, Transformer): from pyspark.ml.connect.proto import TransformerRelation assert isinstance(self._java_obj, str) params = serialize_ml_params(self, session.client) -return ConnectDataFrame( -TransformerRelation( -child=dataset._plan, -name=self._java_obj, -ml_params=params, -uid=self.uid, -is_model=False, -), -session, +plan = TransformerRelation( +child=dataset._plan, +name=self._java_obj, +ml_params=params, +uid=self.uid, +is_model=False, ) + else: raise RuntimeError(f"Unsupported {self}") + +# To delay the GC of the model, keep a reference to the source transformer +# in the transformed dataframe and all its descendants. +# For this case: +# +# def fit_transform(df): +# model = estimator.fit(df) +# return model.transform(df) +# +# output = fit_transform(df) +# +plan.__source_transformer__ = self # type: ignore[attr-defined] Review Comment: we should -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]
attilapiros commented on PR #50122: URL: https://github.com/apache/spark/pull/50122#issuecomment-2719130827 Backport to branch-4.0: https://github.com/apache/spark/pull/50259 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]
attilapiros commented on PR #50122: URL: https://github.com/apache/spark/pull/50122#issuecomment-2719136015 Backport to branch-3.5: https://github.com/apache/spark/pull/50260 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-43221][CORE][4.0] Host local block fetching should use a block status of a block stored on disk [spark]
attilapiros opened a new pull request, #50259: URL: https://github.com/apache/spark/pull/50259 **This is a backport to branch-4.0 from master.** Thanks for @yorksity who reported this error and even provided a PR for it. This solution very different from https://github.com/apache/spark/pull/40883 as `BlockManagerMasterEndpoint#getLocationsAndStatus()` needed some refactoring. ### What changes were proposed in this pull request? This PR fixes an error which can be manifested in the following exception: ``` 25/02/20 09:58:31 ERROR util.Utils: [Executor task launch worker for task 61.0 in stage 67.0 (TID 9391)]: Exception encountered java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBlocks$1(TorrentBroadcast.scala:185) ~[spark-core_2.12-3.3.2.3.3.7190.5-2.jar:3.3.2.3.3.7190.5-2] at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) ~[scala-library-2.12.15.jar:?] at scala.collection.immutable.List.foreach(List.scala:431) ~[scala-library-2.12.15.jar:?] at org.apache.spark.broadcast.TorrentBroadcast.readBlocks(TorrentBroadcast.scala:171) ~[spark-core_2.12-3.3.2.3.3.7190.5-2.jar:3.3.2.3.3.7190.5-2] ``` The PR is changing `BlockManagerMasterEndpoint#getLocationsAndStatus()`. The `BlockManagerMasterEndpoint#getLocationsAndStatus()` function is giving back an optional `BlockLocationsAndStatus` which consist of 3 parts: - `locations`: all the locations where the block can be found (as a sequence of block manager IDs) - `status`: one block status - `localDirs`: optional directory paths which can be used to read block if the block is found in the disk of an executor running on the same host The block (either RDD blocks, shuffle blocks or torrent blocks) can be stored in many executors with different storage levels: disk or memory. This PR changing how the block status and the block manager ID for the `localDirs` is found to guarantee they belong together. ### Why are the changes needed? Before this PR the `BlockManagerMasterEndpoint#getLocationsAndStatus()` was searching for the block status (`status`) and the `localDirs` separately. The block status actually was computed as the very first one where the block can be found. This way it can easily happen this block status was representing an in-memory block (where the disk size is 0 as it is stored in the memory) but the `localDirs` was filled out based on a host local block instance which was stored on disk. This situation can be very frequent but only causing problems (exceptions as above) when encryption is on (spark.io.encryption.enabled=true) as for a not encrypted block the whole file containing the block is read, see https://github.com/apache/spark/blob/branch-3.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1244 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Host local block fetching was already covered by some existing unit tests but a new unit test is provided for this exact case: "SPARK-43221: Host local block fetching should use a block status with disk size". The number of block mangers and the order of the blocks was chosen after some experimentation as the block status order is depends on a `HashSet`, see: ``` private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]] ``` This test was executed with the old code too to validate the issue is reproduced: ``` BlockManagerSuite: OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended - SPARK-43221: Host local block fetching should use a block status with disk size *** FAILED *** 0 was not greater than 0 The block size must be greater than 0 for a nonempty block! (BlockManagerSuite.scala:491) Run completed in 6 seconds, 705 milliseconds. Total number of tests run: 1 Suites: completed 1, aborted 0 Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 *** 1 TEST FAILED *** ``` ### Was this patch authored or co-authored using generative AI tooling? No. (cherry picked from commit 997e599738ebc3b1e1df6a313cd222bb70ef481d) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]
dongjoon-hyun commented on PR #14: URL: https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2719150536 Thank you so much, @attilapiros ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]
simhadri-g commented on code in PR #50213: URL: https://github.com/apache/spark/pull/50213#discussion_r1990961751 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -885,20 +884,6 @@ private[hive] class HiveClientImpl( // Since HIVE-18238(Hive 3.0.0), the Driver.close function's return type changed // and the CommandProcessorFactory.clean function removed. driver.getClass.getMethod("close").invoke(driver) - if (version != hive.v3_0 && version != hive.v3_1 && version != hive.v4_0) { Review Comment: This would break backward compatibility when spark tries to connect with older version of Hive right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]
Ngone51 commented on code in PR #50122: URL: https://github.com/apache/spark/pull/50122#discussion_r1990890014 ## core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala: ## @@ -862,31 +862,50 @@ class BlockManagerMasterEndpoint( private def getLocationsAndStatus( blockId: BlockId, requesterHost: String): Option[BlockLocationsAndStatus] = { -val locations = Option(blockLocations.get(blockId)).map(_.toSeq).getOrElse(Seq.empty) -val status = locations.headOption.flatMap { bmId => - if (externalShuffleServiceRddFetchEnabled && bmId.port == externalShuffleServicePort) { -blockStatusByShuffleService.get(bmId).flatMap(m => m.get(blockId)) - } else { -blockManagerInfo.get(bmId).flatMap(_.getStatus(blockId)) +val allLocations = Option(blockLocations.get(blockId)).map(_.toSeq).getOrElse(Seq.empty) +val hostLocalLocations = allLocations.filter(bmId => bmId.host == requesterHost) + +val blockStatusWithBlockManagerId: Option[(BlockStatus, BlockManagerId)] = + (if (externalShuffleServiceRddFetchEnabled) { + // if fetching RDD is enabled from the external shuffle service then first try to find + // the block in the external shuffle service of the same host + val location = hostLocalLocations.find(_.port == externalShuffleServicePort) + location + .flatMap(blockStatusByShuffleService.get(_).flatMap(_.get(blockId))) + .zip(location) + } else { + None + }) +.orElse { + // if the block is not found via the external shuffle service trying to find it in the + // executors running on the same host and persisted on the disk + // using flatMap on iterators makes the transformation lazy + hostLocalLocations.iterator +.flatMap { bmId => + blockManagerInfo.get(bmId).flatMap { blockInfo => +blockInfo.getStatus(blockId).map((_, bmId)) + } +} +.find(_._1.storageLevel.useDisk) Review Comment: Do we enforce the disk first against the memory after this change? Looks like a behaviour change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation [spark]
pan3793 commented on code in PR #50232: URL: https://github.com/apache/spark/pull/50232#discussion_r1990737841 ## sql/hive-thriftserver/pom.xml: ## @@ -148,16 +148,6 @@ byte-buddy-agent test - Review Comment: Done by updating the `SparkBuild.scala`, it could be verified by ``` build/sbt -Phive-thriftserver hive-thriftserver/Test/dependencyTree | grep hive-llap ``` before ``` [info] +-org.apache.hive:hive-llap-client:2.3.10 [info] +-org.apache.hive:hive-llap-common:2.3.10 [info] | +-org.apache.hive:hive-llap-client:2.3.10 [info] | +-org.apache.hive:hive-llap-common:2.3.10 ``` now result is empty. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51365][TESTS] Test maven + macos [spark]
LuciferYang closed pull request #50178: [SPARK-51365][TESTS] Test maven + macos URL: https://github.com/apache/spark/pull/50178 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48922][SQL] Avoid redundant array transform of identical expression for map type [spark]
beliefer commented on code in PR #50245: URL: https://github.com/apache/spark/pull/50245#discussion_r1991031211 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala: ## @@ -459,11 +459,28 @@ object TableOutputResolver extends SQLConfHelper with Logging { } if (resKey.length == 1 && resValue.length == 1) { - val keyFunc = LambdaFunction(resKey.head, Seq(keyParam)) - val valueFunc = LambdaFunction(resValue.head, Seq(valueParam)) - val newKeys = ArrayTransform(MapKeys(nullCheckedInput), keyFunc) - val newValues = ArrayTransform(MapValues(nullCheckedInput), valueFunc) - Some(Alias(MapFromArrays(newKeys, newValues), expected.name)()) + // If the key and value expressions have not changed, we just check original map field. + // Otherwise, we construct a new map by adding transformations to the keys and values. + if (resKey.head == keyParam && resValue.head == valueParam) { +Some( + Alias(nullCheckedInput, expected.name)( +nonInheritableMetadataKeys = + Seq(CharVarcharUtils.CHAR_VARCHAR_TYPE_STRING_METADATA_KEY))) Review Comment: @viirya Could you help me understand why remove the `CHAR_VARCHAR_TYPE_STRING_METADATA_KEY` from metadata? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48922][SQL] Avoid redundant array transform of identical expression for map type [spark]
beliefer commented on code in PR #50245: URL: https://github.com/apache/spark/pull/50245#discussion_r1990965674 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala: ## @@ -459,11 +459,28 @@ object TableOutputResolver extends SQLConfHelper with Logging { } if (resKey.length == 1 && resValue.length == 1) { - val keyFunc = LambdaFunction(resKey.head, Seq(keyParam)) - val valueFunc = LambdaFunction(resValue.head, Seq(valueParam)) - val newKeys = ArrayTransform(MapKeys(nullCheckedInput), keyFunc) - val newValues = ArrayTransform(MapValues(nullCheckedInput), valueFunc) - Some(Alias(MapFromArrays(newKeys, newValues), expected.name)()) + // If the key and value expressions have not changed, we just check original map field. + // Otherwise, we construct a new map by adding transformations to the keys and values. + if (resKey.head == keyParam && resValue.head == valueParam) { +Some( + Alias(nullCheckedInput, expected.name)( +nonInheritableMetadataKeys = + Seq(CharVarcharUtils.CHAR_VARCHAR_TYPE_STRING_METADATA_KEY))) Review Comment: Could you explain why you want remove the `CHAR_VARCHAR_TYPE_STRING_METADATA_KEY` from metadata? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49485][CORE] Fix speculative task hang bug due to remaining executor lay on same host [spark]
buska88 commented on code in PR #47949: URL: https://github.com/apache/spark/pull/47949#discussion_r1990965346 ## core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala: ## @@ -308,13 +328,59 @@ private[spark] class ExecutorAllocationManager( tasksPerExecutor).toInt val maxNeededWithSpeculationLocalityOffset = - if (tasksPerExecutor > 1 && maxNeeded == 1 && pendingSpeculative > 0) { - // If we have pending speculative tasks and only need a single executor, allocate one more - // to satisfy the locality requirements of speculation - maxNeeded + 1 -} else { - maxNeeded -} + if (pendingSpeculative > 0 && maxNeeded <= numExecutorsTarget && +executorMonitor.executorHostsCount == 1 && scheduler.isInstanceOf[TaskSchedulerImpl]) { +val now = clock.nanoTime() +if (excludeNodesTriggerTime == NOT_SET) { + excludeNodesTriggerTime = now + TimeUnit.MINUTES.toNanos(excludeNodeTriggerTimeoutMin) + logDebug(log"Current timestamp ${MDC(TIMESTAMP, now)}, " + +log"set excludeNodesTriggerTime to ${MDC(TIMESTAMP, excludeNodesTriggerTime)}") + maxNeeded +} else if (now < excludeNodesTriggerTime) { + maxNeeded +} else { + if (executorMonitor.hasAdjustMaxNeed) { +adjustMaxNeed + } else { +logDebug(log"${MDC(TIMESTAMP, now)} exceeds" + + log" ${MDC(TIMESTAMP, excludeNodesTriggerTime)}, start exclude node!") +val node = executorMonitor.getExecutorHostsName(0) +// check if current remaining host has attempts of speculative task +val speculativeTasksInfo = listener.getPendingSpeculativeTasksInfo() +if (scheduler.asInstanceOf[TaskSchedulerImpl]. + speculativeTasksHasAttemptOnHost(node, speculativeTasksInfo)) { + // make sure new maxNeed exceeds numExecutorsTarget and allocate executor + adjustMaxNeed = numExecutorsTarget + 1 + // If hasAdjustMaxNeed, use last adjust value as + // maxNeededWithSpeculationLocalityOffset in case of numExecutorsTarget keeps + // increasing during maxNumExecutorsNeededPerResourceProfile method called + if (scheduler.asInstanceOf[TaskSchedulerImpl].handleExcludeNodes(node)) { +logDebug(log"Exclude ${MDC(HOST, node)} for speculative, " + + log"old maxNeeded: ${MDC(COUNT, maxNeeded)}, " + + log"old numExecutorsTarget: ${MDC(COUNT, numExecutorsTarget)}, " + + log"current executors count: ${MDC(COUNT, executorMonitor.executorCount)}") +excludeSpeculativeNodes.add(node) +executorMonitor.setAdjustMaxNeed(true) +adjustMaxNeed + } else { +logDebug(log"${MDC(HOST, node)} has been excluded for other reason") +maxNeeded + } +} else { + logDebug(log"No speculative task found on ${MDC(HOST, node)}") + maxNeeded +} + } +} + } else if (tasksPerExecutor > 1 && maxNeeded == 1 && pendingSpeculative > 0) { +resetExcludeNodesTriggerTime() +// If we have pending speculative tasks and only need a single executor, allocate one more +// to satisfy the locality requirements of speculation +maxNeeded + 1 + } else { +resetExcludeNodesTriggerTime() +maxNeeded + } Review Comment: Thanks for reviewing.I have noticed the review message today.What if make this feature configurable? Users are more intolerant of being stuck than of wastage in some cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48922][SQL] Avoid redundant array transform of identical expression for map type [spark]
wForget commented on code in PR #50245: URL: https://github.com/apache/spark/pull/50245#discussion_r1990969615 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala: ## @@ -459,11 +459,28 @@ object TableOutputResolver extends SQLConfHelper with Logging { } if (resKey.length == 1 && resValue.length == 1) { - val keyFunc = LambdaFunction(resKey.head, Seq(keyParam)) - val valueFunc = LambdaFunction(resValue.head, Seq(valueParam)) - val newKeys = ArrayTransform(MapKeys(nullCheckedInput), keyFunc) - val newValues = ArrayTransform(MapValues(nullCheckedInput), valueFunc) - Some(Alias(MapFromArrays(newKeys, newValues), expected.name)()) + // If the key and value expressions have not changed, we just check original map field. + // Otherwise, we construct a new map by adding transformations to the keys and values. + if (resKey.head == keyParam && resValue.head == valueParam) { +Some( + Alias(nullCheckedInput, expected.name)( +nonInheritableMetadataKeys = + Seq(CharVarcharUtils.CHAR_VARCHAR_TYPE_STRING_METADATA_KEY))) Review Comment: I just want to be consistent with #47843 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51484][SS] Remove unused function `private def newDFSFileName(String)` from `RocksDBFileManager` [spark]
LuciferYang opened a new pull request, #50249: URL: https://github.com/apache/spark/pull/50249 ### What changes were proposed in this pull request? This pr aims to remove unused function `private def newDFSFileName(String)` from `RocksDBFileManager`, It is no longer used after SPARK-49770 (https://github.com/apache/spark/pull/47875) ### Why are the changes needed? Code cleanup. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51482][SQL] Support cast from string to time [spark]
MaxGekk commented on PR #50236: URL: https://github.com/apache/spark/pull/50236#issuecomment-2717411710 @yaooqinn @LuciferYang @dongjoon-hyun @HyukjinKwon Could you take a look at the PR when you have time, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51482][SQL] Support cast from string to time [spark]
MaxGekk commented on code in PR #50236: URL: https://github.com/apache/spark/pull/50236#discussion_r1991652407 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala: ## @@ -1337,6 +1350,35 @@ case class Cast( } } + private[this] def castToTimeCode( + from: DataType, + ctx: CodegenContext): CastFunction = { +from match { + case _: StringType => +val longOpt = ctx.freshVariable("longOpt", classOf[Option[Long]]) +(c, evPrim, evNull) => + if (ansiEnabled) { +val errorContext = getContextOrNullCode(ctx) +code""" + $evPrim = $dateTimeUtilsCls.stringToTimeAnsi($c, $errorContext); +""" + } else { +code""" + scala.Option $longOpt = + org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTime($c); Review Comment: Fixed, and changed other places too. Please, review the PR: https://github.com/apache/spark/pull/50251 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][PYTHON] Generate with a newline at the end of error-conditions.json [spark]
HyukjinKwon commented on PR #50254: URL: https://github.com/apache/spark/pull/50254#issuecomment-2718324059 cc @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51338][INFRA] Add automated CI build for `connect-examples` [spark]
vicennial commented on code in PR #50187: URL: https://github.com/apache/spark/pull/50187#discussion_r1991794909 ## connect-examples/server-library-example/pom.xml: ## @@ -36,7 +36,8 @@ UTF-8 2.13 2.13.15 -3.25.4 -4.0.0-preview2 +4.29.3 +4.1.0-SNAPSHOT +33.4.0-jre Review Comment: What if we update the release script and add a rule/command to auto-change the project version in this pom file as well? This way, we can satisfy both continuous build compatibility with Spark and be somewhat independent (modulo the dependency on the ASF snapshot repo). I'd like to avoid inheriting the parent pom as that would lead to the project pulling in Spark's default shading rules, version definitions etc. In this specific case, it wouldn't be favourable as it's intended to demonstrate the extension's development using a minimal set of dependencies (spark-sql-api, spark-connect-client, etc.). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]
simhadri-g commented on code in PR #50213: URL: https://github.com/apache/spark/pull/50213#discussion_r1990961751 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -885,20 +884,6 @@ private[hive] class HiveClientImpl( // Since HIVE-18238(Hive 3.0.0), the Driver.close function's return type changed // and the CommandProcessorFactory.clean function removed. driver.getClass.getMethod("close").invoke(driver) - if (version != hive.v3_0 && version != hive.v3_1 && version != hive.v4_0) { Review Comment: This would break backward compatibility when spark tries to connect with older versions of Hive right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [MINOR][PYTHON] Generate with a newline at the end of error-conditions.json [spark]
HyukjinKwon opened a new pull request, #50254: URL: https://github.com/apache/spark/pull/50254 ### What changes were proposed in this pull request? This PR proposes to generate with a newline at the end of `error-conditions.json`. ### Why are the changes needed? To make the styles same as other Python files. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests in CI. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOC] Improve the documentation for configuration. [spark]
0xbadidea closed pull request #50247: [MINOR][DOC] Improve the documentation for configuration. URL: https://github.com/apache/spark/pull/50247 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]
deniskuzZ commented on code in PR #50213: URL: https://github.com/apache/spark/pull/50213#discussion_r1991644845 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: ## @@ -1210,7 +1210,7 @@ class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAnd .zip(parts) .map { case (k, v) => if (v == "NULL") { -s"$k=${ConfVars.DEFAULTPARTITIONNAME.defaultStrVal}" +s"$k=${ConfVars.DEFAULTPARTITIONNAME.getDefaultVal}" Review Comment: you should use `org.apache.hadoop.hive.conf.HiveConf.DEFAULT_PARTITION_NAME` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50188][CONNECT][PYTHON] When the connect client starts, print the server's webUrl [spark]
panbingkun commented on PR #48720: URL: https://github.com/apache/spark/pull/48720#issuecomment-2716707617 > +1 for the idea. Could you rebase this once more, @panbingkun ? Sorry, I have been struggling with a new job recently, and it has been updated! Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [MINOR][SQL] Reuse `dateTimeUtilsCls` in `Cast` [spark]
MaxGekk opened a new pull request, #50251: URL: https://github.com/apache/spark/pull/50251 ### What changes were proposed in this pull request? In the PR, I propose to re-use the method `dateTimeUtilsCls()` instead of `org.apache.spark.sql.catalyst.util.DateTimeUtils` in the `Cast` expression. ### Why are the changes needed? 1. To make code consistent: in some places we use `dateTimeUtilsCls()`, but in others `org.apache.spark.sql.catalyst.util.DateTimeUtils`. 2. To improve code maintenance: `DateTimeUtils` can be moved to another package w/o changing the code generation in `Cast`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By existing GAs. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51485] Add `How to use in your apps` section to `README.md` [spark-connect-swift]
dongjoon-hyun opened a new pull request, #11: URL: https://github.com/apache/spark-connect-swift/pull/11 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51441][SQL] Add DSv2 APIs for constraints [spark]
aokolnychyi opened a new pull request, #50253: URL: https://github.com/apache/spark/pull/50253 ### What changes were proposed in this pull request? This PR adds DSv2 APIs for constraints as per SPIP [doc](https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo/). ### Why are the changes needed? These changes are the first step for constraints support in Spark. ### Does this PR introduce _any_ user-facing change? No, just interfaces. ### How was this patch tested? This PR comes with tests. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][PYTHON] Reformat error classes [spark]
HyukjinKwon commented on PR #50241: URL: https://github.com/apache/spark/pull/50241#issuecomment-2718308691 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][PYTHON] Reformat error classes [spark]
HyukjinKwon commented on code in PR #50241: URL: https://github.com/apache/spark/pull/50241#discussion_r1991779317 ## python/pyspark/errors/error-conditions.json: ## @@ -1208,4 +1208,4 @@ "Index must be non-zero." ] } -} +} Review Comment: it is a generated file. Maybe we should fix the generation. I will make a PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][PYTHON] Reformat error classes [spark]
HyukjinKwon closed pull request #50241: [MINOR][PYTHON] Reformat error classes URL: https://github.com/apache/spark/pull/50241 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51271][PYTHON] Filter serialization for Python Data Source filter pushdown [spark]
wengh opened a new pull request, #50252: URL: https://github.com/apache/spark/pull/50252 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51476][PYTHON][DOCS] Update document type conversion for Pandas UDFs (pyarrow 17.0.0, pandas 2.2.3, Python 3.11) [spark]
HyukjinKwon closed pull request #50242: [SPARK-51476][PYTHON][DOCS] Update document type conversion for Pandas UDFs (pyarrow 17.0.0, pandas 2.2.3, Python 3.11) URL: https://github.com/apache/spark/pull/50242 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43221][CORE] Host local block fetching should use a block status of a block stored on disk [spark]
Ngone51 commented on code in PR #50122: URL: https://github.com/apache/spark/pull/50122#discussion_r1990890014 ## core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala: ## @@ -862,31 +862,50 @@ class BlockManagerMasterEndpoint( private def getLocationsAndStatus( blockId: BlockId, requesterHost: String): Option[BlockLocationsAndStatus] = { -val locations = Option(blockLocations.get(blockId)).map(_.toSeq).getOrElse(Seq.empty) -val status = locations.headOption.flatMap { bmId => - if (externalShuffleServiceRddFetchEnabled && bmId.port == externalShuffleServicePort) { -blockStatusByShuffleService.get(bmId).flatMap(m => m.get(blockId)) - } else { -blockManagerInfo.get(bmId).flatMap(_.getStatus(blockId)) +val allLocations = Option(blockLocations.get(blockId)).map(_.toSeq).getOrElse(Seq.empty) +val hostLocalLocations = allLocations.filter(bmId => bmId.host == requesterHost) + +val blockStatusWithBlockManagerId: Option[(BlockStatus, BlockManagerId)] = + (if (externalShuffleServiceRddFetchEnabled) { + // if fetching RDD is enabled from the external shuffle service then first try to find + // the block in the external shuffle service of the same host + val location = hostLocalLocations.find(_.port == externalShuffleServicePort) + location + .flatMap(blockStatusByShuffleService.get(_).flatMap(_.get(blockId))) + .zip(location) + } else { + None + }) +.orElse { + // if the block is not found via the external shuffle service trying to find it in the + // executors running on the same host and persisted on the disk + // using flatMap on iterators makes the transformation lazy + hostLocalLocations.iterator +.flatMap { bmId => + blockManagerInfo.get(bmId).flatMap { blockInfo => +blockInfo.getStatus(blockId).map((_, bmId)) + } +} +.find(_._1.storageLevel.useDisk) Review Comment: Do we enforce the disk first against the memory after this change? Looks like a behaviour change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51467][UI] Make tables of the environment page filterable [spark]
yaooqinn commented on PR #50233: URL: https://github.com/apache/spark/pull/50233#issuecomment-2716542978 > The overloaded action issues might be considered later in an independent JIRA (if needed later). Thank you @dongjoon-hyun. For potential UX improvements, I'd leave it for further decisions based on user feedback. Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51485] Add `How to use in your apps` section to `README.md` [spark-connect-swift]
dongjoon-hyun commented on PR #11: URL: https://github.com/apache/spark-connect-swift/pull/11#issuecomment-2718378823 Thank you, @viirya . Merged to main. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-51348][BUILD][SQL] Upgrade Hive to 4.0 [spark]
vrozov commented on code in PR #50213: URL: https://github.com/apache/spark/pull/50213#discussion_r1991823559 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -885,20 +884,6 @@ private[hive] class HiveClientImpl( // Since HIVE-18238(Hive 3.0.0), the Driver.close function's return type changed // and the CommandProcessorFactory.clean function removed. driver.getClass.getMethod("close").invoke(driver) - if (version != hive.v3_0 && version != hive.v3_1 && version != hive.v4_0) { Review Comment: I don't think it is a connection problem. There will be a problem when 2.x Hive jars are provided at runtime. It is not clear to me if it is still necessary to support such option granted that 2.x is EOL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51485] Add `How to use in your apps` section to `README.md` [spark-connect-swift]
dongjoon-hyun closed pull request #11: [SPARK-51485] Add `How to use in your apps` section to `README.md` URL: https://github.com/apache/spark-connect-swift/pull/11 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51476][PYTHON][DOCS] Update document type conversion for Pandas UDFs (pyarrow 17.0.0, pandas 2.2.3, Python 3.11) [spark]
HyukjinKwon commented on PR #50242: URL: https://github.com/apache/spark/pull/50242#issuecomment-2718310865 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51493] Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version [spark-connect-swift]
dongjoon-hyun commented on PR #14: URL: https://github.com/apache/spark-connect-swift/pull/14#issuecomment-2719154602 Merged to main~ (I also used this new script to merge this PR.) ``` ... Would you like to update an associated JIRA? (y/N): y Enter a JIRA id [SPARK-51493]: === JIRA SPARK-51493 === Summary Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version Assignee None Status Open Url https://issues.apache.org/jira/browse/SPARK-51493 Affected ['connect-swift-0.1.0'] Check if the JIRA information is as expected (y/N): y JIRA is unassigned, choose assignee [0] Dongjoon Hyun (Reporter) Enter number of user, or userid, to assign to (blank to leave unassigned): 0 Enter comma-separated fix version(s) [connect-swift-0.1.0]: === JIRA SPARK-51493 === Summary Refine `merge_spark_pr.py` to use `connect-swift-x.y.z` version Assignee Dongjoon Hyun Status Resolved Url https://issues.apache.org/jira/browse/SPARK-51493 Affected ['connect-swift-0.1.0'] Fixed['connect-swift-0.1.0'] Successfully resolved SPARK-51493 with fixVersions=['connect-swift-0.1.0']! ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51358] [SS] Introduce snapshot upload lag detection through StateStoreCoordinator [spark]
zecookiez commented on code in PR #50123: URL: https://github.com/apache/spark/pull/50123#discussion_r1992391021 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala: ## @@ -65,6 +65,7 @@ case object StoreTaskCompletionListener extends RocksDBOpType("store_task_comple * @param localRootDir Root directory in local disk that is used to working and checkpointing dirs * @param hadoopConf Hadoop configuration for talking to the remote file system * @param loggingIdId that will be prepended in logs for isolating concurrent RocksDBs + * @param providerListener A reference to the state store provider for event callback reporting Review Comment: Hmm okay, will change that part to "The parent RocksDBStateStoreProvider object" to help make it more clear :+1: ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala: ## @@ -38,9 +38,20 @@ import org.apache.spark.sql.types.StructType import org.apache.spark.unsafe.Platform import org.apache.spark.util.{NonFateSharingCache, Utils} +/** + * Trait representing events reported from a RocksDB instance. + * + * The internal RocksDB instance can use a provider with a `RocksDBEventListener` reference to Review Comment: Good idea, that makes more sense than what I had. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-43221][CORE][3.5] Host local block fetching should use a block status of a block stored on disk [spark]
attilapiros opened a new pull request, #50260: URL: https://github.com/apache/spark/pull/50260 **This is a backport to branch-3.5 from master.** Thanks for @yorksity who reported this error and even provided a PR for it. This solution very different from https://github.com/apache/spark/pull/40883 as `BlockManagerMasterEndpoint#getLocationsAndStatus()` needed some refactoring. ### What changes were proposed in this pull request? This PR fixes an error which can be manifested in the following exception: ``` 25/02/20 09:58:31 ERROR util.Utils: [Executor task launch worker for task 61.0 in stage 67.0 (TID 9391)]: Exception encountered java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBlocks$1(TorrentBroadcast.scala:185) ~[spark-core_2.12-3.3.2.3.3.7190.5-2.jar:3.3.2.3.3.7190.5-2] at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) ~[scala-library-2.12.15.jar:?] at scala.collection.immutable.List.foreach(List.scala:431) ~[scala-library-2.12.15.jar:?] at org.apache.spark.broadcast.TorrentBroadcast.readBlocks(TorrentBroadcast.scala:171) ~[spark-core_2.12-3.3.2.3.3.7190.5-2.jar:3.3.2.3.3.7190.5-2] ``` The PR is changing `BlockManagerMasterEndpoint#getLocationsAndStatus()`. The `BlockManagerMasterEndpoint#getLocationsAndStatus()` function is giving back an optional `BlockLocationsAndStatus` which consist of 3 parts: - `locations`: all the locations where the block can be found (as a sequence of block manager IDs) - `status`: one block status - `localDirs`: optional directory paths which can be used to read block if the block is found in the disk of an executor running on the same host The block (either RDD blocks, shuffle blocks or torrent blocks) can be stored in many executors with different storage levels: disk or memory. This PR changing how the block status and the block manager ID for the `localDirs` is found to guarantee they belong together. ### Why are the changes needed? Before this PR the `BlockManagerMasterEndpoint#getLocationsAndStatus()` was searching for the block status (`status`) and the `localDirs` separately. The block status actually was computed as the very first one where the block can be found. This way it can easily happen this block status was representing an in-memory block (where the disk size is 0 as it is stored in the memory) but the `localDirs` was filled out based on a host local block instance which was stored on disk. This situation can be very frequent but only causing problems (exceptions as above) when encryption is on (spark.io.encryption.enabled=true) as for a not encrypted block the whole file containing the block is read, see https://github.com/apache/spark/blob/branch-3.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1244 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Host local block fetching was already covered by some existing unit tests but a new unit test is provided for this exact case: "SPARK-43221: Host local block fetching should use a block status with disk size". The number of block mangers and the order of the blocks was chosen after some experimentation as the block status order is depends on a `HashSet`, see: ``` private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]] ``` This test was executed with the old code too to validate the issue is reproduced: ``` BlockManagerSuite: OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended - SPARK-43221: Host local block fetching should use a block status with disk size *** FAILED *** 0 was not greater than 0 The block size must be greater than 0 for a nonempty block! (BlockManagerSuite.scala:491) Run completed in 6 seconds, 705 milliseconds. Total number of tests run: 1 Suites: completed 1, aborted 0 Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 *** 1 TEST FAILED *** ``` ### Was this patch authored or co-authored using generative AI tooling? No. (cherry picked from commit 997e599738ebc3b1e1df6a313cd222bb70ef481d) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org