Re: [PR] [SPARK-48505][CORE] Simplify the implementation of `Utils#isG1GC` [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on PR #46873: URL: https://github.com/apache/spark/pull/46873#issuecomment-2230167466 Thank you for pinging me, @LuciferYang . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [SPARK-48890][CORE][SS] Add Structured Streaming related fields to log4j ThreadContext [spark]

2024-07-16 Thread via GitHub
WweiL commented on code in PR #47340: URL: https://github.com/apache/spark/pull/47340#discussion_r1678863246 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala: ## @@ -287,6 +290,11 @@ abstract class StreamExecution( sparkSession.spa

Re: [PR] [SPARK-48752][PYTHON][CONNECT][DOCS] Introduce `pyspark.logger` for improved structured logging for PySpark [spark]

2024-07-16 Thread via GitHub
itholic commented on code in PR #47145: URL: https://github.com/apache/spark/pull/47145#discussion_r1678866930 ## python/docs/source/development/logger.rst: ## @@ -0,0 +1,151 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agre

Re: [PR] [SPARK-44790][SQL] XML: to_xml implementation and bindings for python, connect and SQL [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on code in PR #43503: URL: https://github.com/apache/spark/pull/43503#discussion_r1678869244 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlGenerator.scala: ## @@ -16,164 +16,201 @@ */ package org.apache.spark.sql.catalyst.xml

Re: [PR] [SPARK-48505][CORE] Simplify the implementation of `Utils#isG1GC` [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on code in PR #46873: URL: https://github.com/apache/spark/pull/46873#discussion_r1678871813 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -3072,15 +3072,14 @@ private[spark] object Utils */ lazy val isG1GC: Boolean = { Try {

Re: [PR] [SPARK-48873][SQL] Use UnsafeRow in JSON parser. [spark]

2024-07-16 Thread via GitHub
chenhao-db commented on code in PR #47310: URL: https://github.com/apache/spark/pull/47310#discussion_r1678874248 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4367,6 +4367,14 @@ object SQLConf { .booleanConf .createWithDefault(

Re: [PR] [SPARK-48873][SQL] Use UnsafeRow in JSON parser. [spark]

2024-07-16 Thread via GitHub
chenhao-db commented on PR #47310: URL: https://github.com/apache/spark/pull/47310#issuecomment-2230185352 @HyukjinKwon thanks! could you help merge it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [SPARK-48505][CORE] Simplify the implementation of `Utils#isG1GC` [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on PR #46873: URL: https://github.com/apache/spark/pull/46873#issuecomment-2230191426 BTW, I saw the first attempt here and the comment. - #46783 IIUC, he said he is okay with the old code. > I'd say I'm okay with the old reflection-based version for its

Re: [PR] [MINOR][TESTS] Remove unused test jar (udf_noA.jar) [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on PR #47309: URL: https://github.com/apache/spark/pull/47309#issuecomment-2230213227 😄 Ya, this is the 7th instance. - https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aclosed+is%3Amerged Given that the last one was last year, this is an annual event.

Re: [PR] [SPARK-47307][SQL][3.5] Add a config to optionally chunk base64 strings [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on PR #47325: URL: https://github.com/apache/spark/pull/47325#issuecomment-2230214873 Thank you, @wForget and @yaooqinn ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [PR] [SPARK-48505][CORE] Simplify the implementation of `Utils#isG1GC` [spark]

2024-07-16 Thread via GitHub
LuciferYang commented on code in PR #46873: URL: https://github.com/apache/spark/pull/46873#discussion_r1678908450 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -3072,15 +3072,14 @@ private[spark] object Utils */ lazy val isG1GC: Boolean = { Try { +

Re: [PR] [SPARK-48505][CORE] Simplify the implementation of `Utils#isG1GC` [spark]

2024-07-16 Thread via GitHub
LuciferYang commented on code in PR #46873: URL: https://github.com/apache/spark/pull/46873#discussion_r1678908450 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -3072,15 +3072,14 @@ private[spark] object Utils */ lazy val isG1GC: Boolean = { Try { +

Re: [PR] [SPARK-48505][CORE] Simplify the implementation of `Utils#isG1GC` [spark]

2024-07-16 Thread via GitHub
LuciferYang closed pull request #46873: [SPARK-48505][CORE] Simplify the implementation of `Utils#isG1GC` URL: https://github.com/apache/spark/pull/46873 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [MINOR][TESTS] Remove unused test jar (udf_noA.jar) [spark]

2024-07-16 Thread via GitHub
grundprinzip commented on PR #47309: URL: https://github.com/apache/spark/pull/47309#issuecomment-2230264319 Conceptually yes, but I wanted to spend some time understanding the delta between the merge button and the script and how GH might have changed in between. One idea would be to use a

Re: [PR] [MINOR][TESTS] Remove unused test jar (udf_noA.jar) [spark]

2024-07-16 Thread via GitHub
HyukjinKwon commented on PR #47309: URL: https://github.com/apache/spark/pull/47309#issuecomment-2230287146 Yeah I think we should block it merge button -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [SPARK-44790][SQL] XML: to_xml implementation and bindings for python, connect and SQL [spark]

2024-07-16 Thread via GitHub
HyukjinKwon commented on code in PR #43503: URL: https://github.com/apache/spark/pull/43503#discussion_r1678954714 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlGenerator.scala: ## @@ -16,164 +16,201 @@ */ package org.apache.spark.sql.catalyst.xml

Re: [PR] [SPARK-48873][SQL] Use UnsafeRow in JSON parser. [spark]

2024-07-16 Thread via GitHub
HyukjinKwon commented on PR #47310: URL: https://github.com/apache/spark/pull/47310#issuecomment-2230290146 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [SPARK-48873][SQL] Use UnsafeRow in JSON parser. [spark]

2024-07-16 Thread via GitHub
HyukjinKwon closed pull request #47310: [SPARK-48873][SQL] Use UnsafeRow in JSON parser. URL: https://github.com/apache/spark/pull/47310 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-48873][SQL] Use UnsafeRow in JSON parser. [spark]

2024-07-16 Thread via GitHub
LuciferYang commented on PR #47310: URL: https://github.com/apache/spark/pull/47310#issuecomment-2230296253 Sorry, forgot to review this one. Late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

Re: [PR] [SPARK-48865][SQL] Add try_url_decode function [spark]

2024-07-16 Thread via GitHub
yaooqinn commented on code in PR #47294: URL: https://github.com/apache/spark/pull/47294#discussion_r1678963389 ## sql/core/src/test/resources/sql-tests/inputs/url-functions.sql: ## @@ -17,4 +17,10 @@ select url_encode(null); select url_decode('https%3A%2F%2Fspark.apache.org');

Re: [PR] [DO-NOT-MERGE][SQL] Reduce the number of shuffles in SQL in local mode [spark]

2024-07-16 Thread via GitHub
pan3793 commented on code in PR #47349: URL: https://github.com/apache/spark/pull/47349#discussion_r1678971316 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -641,14 +641,19 @@ object SQLConf { .checkValue(_ > 0, "The value of spark.sql.le

[PR] [SPARK-48907][SQL] Fix the value `explicitTypes` in `COLLATION_MISMATCH.EXPLICIT` [spark]

2024-07-16 Thread via GitHub
panbingkun opened a new pull request, #47365: URL: https://github.com/apache/spark/pull/47365 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change?

Re: [PR] [SPARK-48865][SQL] Add try_url_decode function [spark]

2024-07-16 Thread via GitHub
yaooqinn commented on PR #47294: URL: https://github.com/apache/spark/pull/47294#issuecomment-2230342454 Also cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala [spark]

2024-07-16 Thread via GitHub
cloud-fan commented on code in PR #47301: URL: https://github.com/apache/spark/pull/47301#discussion_r1679019564 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala: ## @@ -201,6 +201,22 @@ final class DataFrameWriter[T] private[sql] (ds: D

Re: [PR] [SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala [spark]

2024-07-16 Thread via GitHub
cloud-fan commented on code in PR #47301: URL: https://github.com/apache/spark/pull/47301#discussion_r1679022463 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala: ## @@ -201,6 +201,22 @@ final class DataFrameWriter[T] private[sql] (ds: D

Re: [PR] [SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala [spark]

2024-07-16 Thread via GitHub
cloud-fan commented on code in PR #47301: URL: https://github.com/apache/spark/pull/47301#discussion_r1679027390 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala: ## @@ -209,10 +209,25 @@ object ClusterBySpec { normalizeClusterBySpec(sc

Re: [PR] [SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala [spark]

2024-07-16 Thread via GitHub
cloud-fan commented on code in PR #47301: URL: https://github.com/apache/spark/pull/47301#discussion_r1679031615 ## sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala: ## @@ -104,9 +106,27 @@ final class DataFrameWriterV2[T] private[sql](table: String, ds: Dat

Re: [PR] [SPARK-48907][SQL] Fix the value `explicitTypes` in `COLLATION_MISMATCH.EXPLICIT` [spark]

2024-07-16 Thread via GitHub
panbingkun commented on PR #47365: URL: https://github.com/apache/spark/pull/47365#issuecomment-2230411842 cc @mihailom-db @cloud-fan @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [SPARK-48907][SQL] Fix the value `explicitTypes` in `COLLATION_MISMATCH.EXPLICIT` [spark]

2024-07-16 Thread via GitHub
panbingkun commented on code in PR #47365: URL: https://github.com/apache/spark/pull/47365#discussion_r1679059383 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala: ## @@ -3675,7 +3675,7 @@ private[sql] object QueryCompilationErrors extends

Re: [PR] [SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata [spark]

2024-07-16 Thread via GitHub
HyukjinKwon commented on PR #47347: URL: https://github.com/apache/spark/pull/47347#issuecomment-2230473641 Sure, I will separate the PR. Thanks for reviewing this closely 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[PR] [SPARK-48909][ML][MLlib] Uses SparkSession over SparkContext when writing metadata [spark]

2024-07-16 Thread via GitHub
HyukjinKwon opened a new pull request, #47366: URL: https://github.com/apache/spark/pull/47366 ### What changes were proposed in this pull request? This PR proposes to use SparkSession over SparkContext when writing metadata ### Why are the changes needed? See https://git

Re: [PR] [SPARK-48763][CONNECT][BUILD] Move connect server and common to builtin module [spark]

2024-07-16 Thread via GitHub
bjornjorgensen commented on PR #47157: URL: https://github.com/apache/spark/pull/47157#issuecomment-2230502432 oh.. my fault there was something else that did not work as intended on my build system. @HyukjinKwon thanks for double checking. -- This is an automated message from the Apa

[PR] [SPARK-48910] Use HashSet to avoid linear searches in PreprocessTableCreation [spark]

2024-07-16 Thread via GitHub
vladimirg-db opened a new pull request, #47367: URL: https://github.com/apache/spark/pull/47367 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### Ho

Re: [PR] [DO-NOT-MERGE][SQL] Reduce the number of shuffles in SQL in local mode [spark]

2024-07-16 Thread via GitHub
HyukjinKwon commented on code in PR #47349: URL: https://github.com/apache/spark/pull/47349#discussion_r1679158703 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -641,14 +641,19 @@ object SQLConf { .checkValue(_ > 0, "The value of spark.sq

Re: [PR] [SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata [spark]

2024-07-16 Thread via GitHub
HyukjinKwon commented on PR #47347: URL: https://github.com/apache/spark/pull/47347#issuecomment-2230562169 https://github.com/apache/spark/pull/47366 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] [DO-NOT-MERGE][SQL] Reduce the number of shuffles in SQL in local mode [spark]

2024-07-16 Thread via GitHub
HyukjinKwon commented on code in PR #47349: URL: https://github.com/apache/spark/pull/47349#discussion_r1679158703 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -641,14 +641,19 @@ object SQLConf { .checkValue(_ > 0, "The value of spark.sq

[PR] [SPARK-48510] Fix for UDAF `toColumn` API when running tests in Maevn [spark]

2024-07-16 Thread via GitHub
xupefei opened a new pull request, #47368: URL: https://github.com/apache/spark/pull/47368 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was

[PR] [ONLY TEST][SQL] Improve TPCDSCollationQueryTestSuite [spark]

2024-07-16 Thread via GitHub
panbingkun opened a new pull request, #47369: URL: https://github.com/apache/spark/pull/47369 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

[PR] [SPARK-36680][SQL][FOLLOWUP] Files with options should be put into resolveDataSource function [spark]

2024-07-16 Thread via GitHub
logze opened a new pull request, #47370: URL: https://github.com/apache/spark/pull/47370 ### What changes were proposed in this pull request? When reading csv, json and other files, pass the options parameter to the rules.resolveDataSource method to make the options parameter

Re: [PR] [SC-170296] GROUP BY with MapType nested inside complex type [spark]

2024-07-16 Thread via GitHub
nebojsa-db commented on code in PR #47331: URL: https://github.com/apache/spark/pull/47331#discussion_r1679332626 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala: ## @@ -892,132 +892,108 @@ case class MapFromEntries(child: Expr

[PR] [SPARK-47307][DOCS][FOLLOWUP] Add a migration guide for the behavior change of base64 function [spark]

2024-07-16 Thread via GitHub
wForget opened a new pull request, #47371: URL: https://github.com/apache/spark/pull/47371 ### What changes were proposed in this pull request? Follow up to #47303 Add a migration guide for the behavior change of `base64` function ### Why are the changes neede

Re: [PR] [SPARK-47307][DOCS][FOLLOWUP] Add a migration guide for the behavior change of base64 function [spark]

2024-07-16 Thread via GitHub
wForget commented on PR #47371: URL: https://github.com/apache/spark/pull/47371#issuecomment-2230858686 cc @yaooqinn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

Re: [PR] [SPARK-48628][CORE] Add task peak on/off heap memory metrics [spark]

2024-07-16 Thread via GitHub
Ngone51 commented on code in PR #47192: URL: https://github.com/apache/spark/pull/47192#discussion_r1679431752 ## core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java: ## @@ -202,6 +226,18 @@ public long acquireExecutionMemory(long required, MemoryConsumer requesti

Re: [PR] [SPARK-47307][DOCS][FOLLOWUP] Add a migration guide for the behavior change of base64 function [spark]

2024-07-16 Thread via GitHub
pan3793 commented on code in PR #47371: URL: https://github.com/apache/spark/pull/47371#discussion_r1679440173 ## docs/sql-migration-guide.md: ## @@ -64,6 +64,7 @@ license: | ## Upgrading from Spark SQL 3.5.1 to 3.5.2 - Since 3.5.2, MySQL JDBC datasource will read TINYINT UN

Re: [PR] [SPARK-44790][SQL] XML: to_xml implementation and bindings for python, connect and SQL [spark]

2024-07-16 Thread via GitHub
sandip-db commented on code in PR #43503: URL: https://github.com/apache/spark/pull/43503#discussion_r1679453180 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlGenerator.scala: ## @@ -16,164 +16,201 @@ */ package org.apache.spark.sql.catalyst.xml +i

Re: [PR] [SPARK-48388][SQL] Fix SET statement behavior for SQL Scripts [spark]

2024-07-16 Thread via GitHub
cloud-fan commented on code in PR #47272: URL: https://github.com/apache/spark/pull/47272#discussion_r1679508361 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -61,11 +61,18 @@ compoundBody compoundStatement : statement +| com

Re: [PR] [SPARK-48388][SQL] Fix SET statement behavior for SQL Scripts [spark]

2024-07-16 Thread via GitHub
cloud-fan commented on code in PR #47272: URL: https://github.com/apache/spark/pull/47272#discussion_r1679511481 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -251,26 +258,29 @@ statement | (MSCK)? REPAIR TABLE identifierReference

Re: [PR] [SPARK-48861][SQL] Enable shuffle file removal/skipMigration for all SQL executions [spark]

2024-07-16 Thread via GitHub
abellina commented on PR #47360: URL: https://github.com/apache/spark/pull/47360#issuecomment-2231084434 @bozhang2820 @cloud-fan fyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

Re: [PR] [SPARK-48791][CORE][3.5] Fix perf regression caused by the accumulators registration overhead using CopyOnWriteArrayList [spark]

2024-07-16 Thread via GitHub
LuciferYang commented on PR #47297: URL: https://github.com/apache/spark/pull/47297#issuecomment-2231142709 > `Scala 2.13 build with SBT` failed: > > ``` > [error] /home/runner/work/spark/spark/mllib-local/src/main/scala/org/apache/spark/ml/stat/distribution/MultivariateGaussian.sc

Re: [PR] [SPARK-47307][DOCS][FOLLOWUP] Add a migration guide for the behavior change of base64 function [spark]

2024-07-16 Thread via GitHub
wForget commented on code in PR #47371: URL: https://github.com/apache/spark/pull/47371#discussion_r1679591679 ## docs/sql-migration-guide.md: ## @@ -64,6 +64,7 @@ license: | ## Upgrading from Spark SQL 3.5.1 to 3.5.2 - Since 3.5.2, MySQL JDBC datasource will read TINYINT UN

Re: [PR] [SPARK-48388][SQL] Fix SET statement behavior for SQL Scripts [spark]

2024-07-16 Thread via GitHub
davidm-db commented on code in PR #47272: URL: https://github.com/apache/spark/pull/47272#discussion_r1679605344 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -61,11 +61,18 @@ compoundBody compoundStatement : statement +| com

Re: [PR] [SPARK-48388][SQL] Fix SET statement behavior for SQL Scripts [spark]

2024-07-16 Thread via GitHub
davidm-db commented on code in PR #47272: URL: https://github.com/apache/spark/pull/47272#discussion_r1679616760 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -251,26 +258,29 @@ statement | (MSCK)? REPAIR TABLE identifierReference

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
sahnib commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1679584589 ## python/pyspark/worker.py: ## @@ -1609,6 +1645,35 @@ def mapper(a): vals = [a[o] for o in parsed_offsets[0][1]] return f(keys, vals) +e

Re: [PR] [SPARK-48510] Fix for UDAF `toColumn` API when running tests in Maven [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on PR #47368: URL: https://github.com/apache/spark/pull/47368#issuecomment-223129 Thank you, @xupefei . cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] [SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on PR #47347: URL: https://github.com/apache/spark/pull/47347#issuecomment-2231293170 Thank you, @HyukjinKwon and all. Merged to master for Apache Spark 4.0.0-preview2. -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] [SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun closed pull request #47347: [SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata URL: https://github.com/apache/spark/pull/47347 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

Re: [PR] [SPARK-48909][ML][MLLIB] Uses SparkSession over SparkContext when writing metadata [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun closed pull request #47366: [SPARK-48909][ML][MLLIB] Uses SparkSession over SparkContext when writing metadata URL: https://github.com/apache/spark/pull/47366 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[PR] [WIP][SPARK-48911] Improve collation support testing for various expressions [spark]

2024-07-16 Thread via GitHub
uros-db opened a new pull request, #47372: URL: https://github.com/apache/spark/pull/47372 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
ericm-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1679749407 ## python/pyspark/sql/pandas/serializers.py: ## @@ -1116,3 +1121,88 @@ def init_stream_yield_batches(batches): batches_to_write = init_stream_yield_batches(se

Re: [PR] [SPARK-48382]Add controller / reconciler module to operator [spark-kubernetes-operator]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on PR #12: URL: https://github.com/apache/spark-kubernetes-operator/pull/12#issuecomment-2231421900 Thank you. Did you finish the updates, @jiangzho ? It seems that there are some un-addressed comments. -- This is an automated message from the Apache Git Service.

Re: [PR] [SPARK-48900] Add `reason` field for `cancelJobGroup` and `cancelJobsWithTag` [spark]

2024-07-16 Thread via GitHub
mingkangli-db commented on code in PR #47361: URL: https://github.com/apache/spark/pull/47361#discussion_r1679822519 ## core/src/main/scala/org/apache/spark/scheduler/DAGSchedulerEvent.scala: ## @@ -65,10 +65,13 @@ private[scheduler] case class JobCancelled( private[scheduler

Re: [PR] [SPARK-48382]Add controller / reconciler module to operator [spark-kubernetes-operator]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on PR #12: URL: https://github.com/apache/spark-kubernetes-operator/pull/12#issuecomment-2231493319 - I reviewed the previous comments and resolved when it's addressed. So, please go through the remaining open comments. We need to address them. - BTW, please re-co

[PR] add possibility to set log filename & disable spark log rotation [spark]

2024-07-16 Thread via GitHub
Tocard opened a new pull request, #47373: URL: https://github.com/apache/spark/pull/47373 As spark cluster administration I want to manage logs of application by my own. Spark-deamon.sh has two main issues with it. - mandatory log file rotation and no way to disbale it - har

Re: [PR] [SPARK-48505][CORE] Simplify the implementation of `Utils#isG1GC` [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on PR #46873: URL: https://github.com/apache/spark/pull/46873#issuecomment-2231557976 Thank you so much, @LuciferYang . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-44790][SQL] XML: to_xml implementation and bindings for python, connect and SQL [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on code in PR #43503: URL: https://github.com/apache/spark/pull/43503#discussion_r1679886747 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlGenerator.scala: ## @@ -16,164 +16,201 @@ */ package org.apache.spark.sql.catalyst.xml

Re: [PR] [SPARK-44790][SQL] XML: to_xml implementation and bindings for python, connect and SQL [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on code in PR #43503: URL: https://github.com/apache/spark/pull/43503#discussion_r1679910072 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlGenerator.scala: ## @@ -16,164 +16,201 @@ */ package org.apache.spark.sql.catalyst.xml

Re: [PR] [SPARK-36680][SQL][FOLLOWUP] Files with options should be put into resolveDataSource function [spark]

2024-07-16 Thread via GitHub
szehon-ho commented on code in PR #47370: URL: https://github.com/apache/spark/pull/47370#discussion_r1679926575 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala: ## @@ -50,7 +52,11 @@ class ResolveSQLOnFile(sparkSession: SparkSession) extends R

[PR] [WIP] [SPARK-48900] Add `reason` field for all internal calls for job/stage cancellation [spark]

2024-07-16 Thread via GitHub
mingkangli-db opened a new pull request, #47374: URL: https://github.com/apache/spark/pull/47374 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### H

Re: [PR] Corrected row index usage when exploding packed arrays in vectorized reader [spark]

2024-07-16 Thread via GitHub
djspiewak commented on PR #46928: URL: https://github.com/apache/spark/pull/46928#issuecomment-2231694299 Is this being held up by anything? Any JIRA would be a fairly trivial transliteration of the test case that I added. Note the query and the example parquet file. That example does not w

Re: [PR] [SPARK-48628][CORE] Add task peak on/off heap memory metrics [spark]

2024-07-16 Thread via GitHub
liuzqt commented on code in PR #47192: URL: https://github.com/apache/spark/pull/47192#discussion_r1679983248 ## core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java: ## @@ -202,6 +226,18 @@ public long acquireExecutionMemory(long required, MemoryConsumer requestin

Re: [PR] [SPARK-48791][CORE][3.4] Fix perf regression caused by the accumulators registration overhead using CopyOnWriteArrayList [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on PR #47348: URL: https://github.com/apache/spark/pull/47348#issuecomment-2231724205 This PR seems to be ready, but let's wait until we merge #47297 because we need to keep the backporting order `master` -> `branch-3.5` -> `branch-3.4` to prevent any future regressi

Re: [PR] [SPARK-48791][CORE][3.5] Fix perf regression caused by the accumulators registration overhead using CopyOnWriteArrayList [spark]

2024-07-16 Thread via GitHub
dongjoon-hyun commented on PR #47297: URL: https://github.com/apache/spark/pull/47297#issuecomment-2231726254 Given that `branch-3.4` PR works, could you re-trigger CI, @Ngone51 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1679998994 ## python/pyspark/sql/pandas/group_ops.py: ## @@ -358,6 +362,141 @@ def applyInPandasWithState( ) return DataFrame(jdf, self.session) + +de

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680002466 ## python/pyspark/sql/pandas/group_ops.py: ## @@ -358,6 +362,141 @@ def applyInPandasWithState( ) return DataFrame(jdf, self.session) + +de

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680012217 ## python/pyspark/sql/streaming/StateMessage_pb2.pyi: ## @@ -0,0 +1,139 @@ +from google.protobuf.internal import enum_type_wrapper as _enum_type_wrapper Review Co

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680012933 ## sql/api/src/main/scala/org/apache/spark/sql/catalyst/streaming/InternalTimeModes.scala: ## @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680013451 ## python/pyspark/worker.py: ## @@ -487,6 +489,20 @@ def wrapped(key_series, value_series): return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on PR #47133: URL: https://github.com/apache/spark/pull/47133#issuecomment-2231760411 @bogao007 - this PR is doing a lot. Could we please atleast a high level description of what files are being added, how they are being used and what features will work after this cha

Re: [PR] [SPARK-48700] [SQL] Mode expression for complex types (all collations) [spark]

2024-07-16 Thread via GitHub
GideonPotok commented on code in PR #47154: URL: https://github.com/apache/spark/pull/47154#discussion_r1680023732 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala: ## @@ -86,6 +71,78 @@ case class Mode( buffer } + private

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
bogao007 commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680030806 ## python/pyspark/sql/streaming/StateMessage_pb2.pyi: ## @@ -0,0 +1,139 @@ +from google.protobuf.internal import enum_type_wrapper as _enum_type_wrapper Review Commen

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680017992 ## python/pyspark/sql/streaming/__init__.py: ## @@ -19,3 +19,4 @@ from pyspark.sql.streaming.readwriter import DataStreamReader, DataStreamWriter # noqa: F401

[PR] [SPARK-48914][SQL] Add OFFSET operator as an option in the subquery generator [spark]

2024-07-16 Thread via GitHub
averyqi-db opened a new pull request, #47375: URL: https://github.com/apache/spark/pull/47375 ### What changes were proposed in this pull request? This adds offset operator in subquery generator suite. ### Why are the changes needed? Complete the subquery generator fu

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680039967 ## python/pyspark/sql/streaming/StateMessage_pb2.pyi: ## @@ -0,0 +1,139 @@ +from google.protobuf.internal import enum_type_wrapper as _enum_type_wrapper Review Co

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680041729 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala: ## @@ -201,7 +202,7 @@ object ExpressionEncoder { * object. Thu

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680042299 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala: ## @@ -161,6 +161,41 @@ case class FlatMapGroupsInPandasWi

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680043141 ## sql/core/pom.xml: ## @@ -240,11 +241,42 @@ bcpkix-jdk18on test + + com.github.jnr Review Comment: Could we do this dependency u

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
bogao007 commented on PR #47133: URL: https://github.com/apache/spark/pull/47133#issuecomment-2231795368 > @bogao007 - this PR is doing a lot. Could we please atleast a high level description of what files are being added, how they are being used and what features will work after this chang

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
bogao007 commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680045141 ## sql/core/pom.xml: ## @@ -240,11 +241,42 @@ bcpkix-jdk18on test + + com.github.jnr Review Comment: This is needed for unix domain so

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680045887 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala: ## @@ -0,0 +1,287 @@ +/* + * Licensed to the Apache S

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680046203 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala: ## @@ -0,0 +1,287 @@ +/* + * Licensed to the Apache S

Re: [PR] [WIP][SPARK-42204][CORE] Add option to disable redundant logging of TaskMetrics internal accumulators in event logs [spark]

2024-07-16 Thread via GitHub
rednaxelafx commented on PR #39763: URL: https://github.com/apache/spark/pull/39763#issuecomment-2231799756 This looks good to me as well (non-binding). Thanks a lot for reviving this improvement! Agreed that the historical Spark UI itself wouldn't be affected. The REST API will get some

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680046810 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala: ## @@ -0,0 +1,287 @@ +/* + * Licensed to the Apache S

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680047393 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala: ## @@ -0,0 +1,287 @@ +/* + * Licensed to the Apache S

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680047982 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala: ## @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache S

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680048469 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasWriter.scala: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Softwar

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680048865 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala: ## @@ -17,6 +17,7 @@ package org.apache.spark.sql.execution.st

Re: [PR] [SPARK-48903][SS] Set the RocksDB last snapshot version correctly on remote load [spark]

2024-07-16 Thread via GitHub
HeartSaVioR commented on code in PR #47363: URL: https://github.com/apache/spark/pull/47363#discussion_r1680048571 ## sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala: ## @@ -1663,9 +1670,8 @@ class RocksDBSuite extends AlsoTestWithChang

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680059077 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala: ## @@ -408,6 +408,9 @@ object StateStoreProvider { hadoopConf: C

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1680059417 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateInPandasSuite.scala: ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Founda

Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-16 Thread via GitHub
anishshri-db commented on PR #47133: URL: https://github.com/apache/spark/pull/47133#issuecomment-2231820006 @bogao007 - test failure seems related ? ``` [error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasEx

  1   2   >