Re: [PR] [SPARK-48628][CORE] Add task peak on/off heap memory metrics [spark]
mridulm commented on code in PR #47192: URL: https://github.com/apache/spark/pull/47192#discussion_r1690907811 ## core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala: ## @@ -110,9 +112,22 @@ class TaskMetrics private[spark] () extends Serializable { * joins. The value of this accumulator should be approximately the sum of the peak sizes * across all such data structures created in this task. For SQL jobs, this only tracks all * unsafe operators and ExternalSort. + * This is not equal to peakOnHeapExecutionMemory + peakOffHeapExecutionMemory */ + // TODO: SPARK-48789: the naming is confusing since this does not really reflect the whole + // execution memory. We'd better deprecate this once we have a replacement. def peakExecutionMemory: Long = _peakExecutionMemory.sum + /** + * Peak on heap execution memory as tracked by TaskMemoryManager. + */ + def peakOnHeapExecutionMemory: Long = _peakOnHeapExecutionMemory.sum + + /** + * Peak off heap execution memory as tracked by TaskMemoryManager. + */ + def peakOffHeapExecutionMemory: Long = _peakOffHeapExecutionMemory.sum Review Comment: Discuss: Is it required that `peakExecutionMemory` <= `peakOnHeapExecutionMemory + peakOffHeapExecutionMemory` ? Any cases where this might get violated ? I am trying to reason about completeness of these metrics (given we want to eventually deprecate the existing one). I expect the above to hold, but want to make sure I am not missing anything. +CC @JoshRosen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48844][FOLLOWUP][TESTS] Cleanup duplicated data resource files in hive-thriftserver test [spark]
yaooqinn commented on code in PR #47480: URL: https://github.com/apache/spark/pull/47480#discussion_r1690912988 ## sql/core/src/test/resources/sql-tests/inputs/sql-on-files.sql: ## @@ -1,19 +1,30 @@ +CREATE DATABASE IF NOT EXISTS sql_on_files; -- Parquet +CREATE TABLE sql_on_files.TEST_PARQUET USING PARQUET AS SELECT 1; Review Comment: That makes two of us. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48829][BUILD] Upgrade `RoaringBitmap` to 1.2.1 [spark]
LuciferYang commented on PR #47247: URL: https://github.com/apache/spark/pull/47247#issuecomment-2249601828 ready to go? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [ONLY TEST][HOLD] Upgrade rocksdbjni to 9.4.0 [spark]
LuciferYang commented on PR #47207: URL: https://github.com/apache/spark/pull/47207#issuecomment-2249610870 Has there been any new progress on this one -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [TEST janino] v3.1.12 VS v3.1.9 [spark]
LuciferYang commented on PR #47455: URL: https://github.com/apache/spark/pull/47455#issuecomment-2249631043 Are there any conclusions from this test? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [TEST janino] v3.1.12 VS v3.1.9 [spark]
panbingkun commented on PR #47455: URL: https://github.com/apache/spark/pull/47455#issuecomment-2249724363 > Are there any conclusions from this test? From the testing, it seems that there is no significant difference in performance. Regarding the version of `janino`, we seem to be lagging behind a lot, as follows: https://github.com/user-attachments/assets/108448fd-d42b-47ed-9774-42e71bd1cffd";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation [spark]
vladimirg-db closed pull request #47424: [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation URL: https://github.com/apache/spark/pull/47424 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [TEST janino] v3.1.12 VS v3.1.9 [spark]
LuciferYang commented on PR #47455: URL: https://github.com/apache/spark/pull/47455#issuecomment-2249732282 Since this component is very critical, if there is no noticeable performance enhancement or critical bug fix, I recommend maintaining the use of the current version. cc @cloud-fan @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation [spark]
vladimirg-db opened a new pull request, #47484: URL: https://github.com/apache/spark/pull/47484 ### What changes were proposed in this pull request? Use `HashSet`/`HashMap` instead of doing linear searches over the `Seq`. In case of 1000s of partitions this significantly improves the performance. ### Why are the changes needed? To avoid the O(n*m) passes in the `PreprocessTableCreation` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation [spark]
vladimirg-db commented on PR #47424: URL: https://github.com/apache/spark/pull/47424#issuecomment-2249742061 Recreated my form again... Also deleted apache-spark-ci-image https://github.com/apache/spark/pull/47484 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy [spark]
cloud-fan commented on PR #47483: URL: https://github.com/apache/spark/pull/47483#issuecomment-2249767736 This fixes a regression caused by https://github.com/apache/spark/pull/46565/files#diff-fbc6da30b8372e4f9aeb35ccf0d39eb796715d192c7eaeab109376584de0790eR121 , and make Delta Lake pass all tests. I think we should include it in 3.5.2. cc @yaooqinn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy [spark]
yaooqinn commented on PR #47483: URL: https://github.com/apache/spark/pull/47483#issuecomment-2249780123 Do we need this for branch 3.4? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy [spark]
yaooqinn commented on PR #47483: URL: https://github.com/apache/spark/pull/47483#issuecomment-2249812967 Merged to branch-3.5, thank you all -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy [spark]
yaooqinn closed pull request #47483: [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy URL: https://github.com/apache/spark/pull/47483 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [TEST janino] v3.1.12 VS v3.1.9 [spark]
panbingkun commented on PR #47455: URL: https://github.com/apache/spark/pull/47455#issuecomment-2249830972 Actually, it has some bug fixes, such as typical ones: [spark compilation failed with ArrayIndexOutOfBoundsException](https://github.com/janino-compiler/janino/issues/208) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [TEST janino] v3.1.12 VS v3.1.9 [spark]
LuciferYang commented on PR #47455: URL: https://github.com/apache/spark/pull/47455#issuecomment-2249837072 > Actually, it has some bug fixes, such as typical ones: [Fixed issue `spark compilation failed with ArrayIndexOutOfBoundsException`](https://github.com/janino-compiler/janino/issues/208) Can we add the corresponding test cases in the current pr ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation [spark]
vladimirg-db commented on code in PR #47484: URL: https://github.com/apache/spark/pull/47484#discussion_r1691116665 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala: ## @@ -248,10 +249,15 @@ case class PreprocessTableCreation(catalog: SessionCatalog) extends Rule[Logical DDLUtils.checkTableColumns(tableDesc.copy(schema = analyzedQuery.schema)) val output = analyzedQuery.output -val partitionAttrs = normalizedTable.partitionColumnNames.map { partCol => - output.find(_.name == partCol).get Review Comment: @HyukjinKwon do you know if the original code is correct? The attribute references here are compared by their names and not ids -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy [spark]
cloud-fan commented on PR #47483: URL: https://github.com/apache/spark/pull/47483#issuecomment-2249894610 It's fine to skip 3.4 as https://github.com/apache/spark/pull/46565 was not merged to 3.4 either. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48844][FOLLOWUP][TESTS] Cleanup duplicated data resource files in hive-thriftserver test [spark]
yaooqinn closed pull request #47480: [SPARK-48844][FOLLOWUP][TESTS] Cleanup duplicated data resource files in hive-thriftserver test URL: https://github.com/apache/spark/pull/47480 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy [spark]
yaooqinn commented on PR #47483: URL: https://github.com/apache/spark/pull/47483#issuecomment-2249906895 Thank you @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48844][FOLLOWUP][TESTS] Cleanup duplicated data resource files in hive-thriftserver test [spark]
yaooqinn commented on PR #47480: URL: https://github.com/apache/spark/pull/47480#issuecomment-2249910142 Merged to master, thank you @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation [spark]
vladimirg-db commented on PR #47484: URL: https://github.com/apache/spark/pull/47484#issuecomment-2250017548 @HyukjinKwon hi. Finally managed to make the Actions in my fork work. Tests passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48344][SQL] SQL API change to support execution of compound statements [spark]
miland-db commented on code in PR #47403: URL: https://github.com/apache/spark/pull/47403#discussion_r1691277611 ## sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala: ## @@ -73,11 +74,19 @@ case class SqlScriptingInterpreter() { .map(new SingleStatementExec(_, Origin(), isInternal = true)) .reverse new CompoundBodyExec( - body.collection.map(st => transformTreeIntoExecutable(st)) ++ dropVariables) + body.collection.map(st => transformTreeIntoExecutable(st)) ++ dropVariables, session) case sparkStatement: SingleStatement => new SingleStatementExec( sparkStatement.parsedPlan, sparkStatement.origin, - isInternal = false) + shouldCollectResult = true) } + + def execute(compoundBody: CompoundBody): Iterator[Array[Row]] = { Review Comment: For now, because we don't know which one is the last statement to do only one `collect()` and use noop sink for the rest, we decided to do collect for all statements. Either way, it is important to execute each statement as soon as we encounter it to be able to handle errors properly. [PR introducing handlers](https://github.com/apache/spark/pull/47423) is currently work in progress and will probably explain why we did things the way we did in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Update doc `sql/README.md` [spark]
HyukjinKwon commented on PR #47476: URL: https://github.com/apache/spark/pull/47476#issuecomment-2250101451 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation [spark]
vladimirg-db commented on code in PR #47484: URL: https://github.com/apache/spark/pull/47484#discussion_r1691291589 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala: ## @@ -248,10 +249,15 @@ case class PreprocessTableCreation(catalog: SessionCatalog) extends Rule[Logical DDLUtils.checkTableColumns(tableDesc.copy(schema = analyzedQuery.schema)) val output = analyzedQuery.output -val partitionAttrs = normalizedTable.partitionColumnNames.map { partCol => - output.find(_.name == partCol).get Review Comment: Ok nvm, should be fine, since `partitionColumnNames` are just strings -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Update doc `sql/README.md` [spark]
HyukjinKwon closed pull request #47476: [MINOR][DOCS] Update doc `sql/README.md` URL: https://github.com/apache/spark/pull/47476 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator [spark]
HeartSaVioR commented on code in PR #47445: URL: https://github.com/apache/spark/pull/47445#discussion_r1691375121 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala: ## @@ -208,13 +208,16 @@ class IncrementalExecution( } val schemaValidationResult = statefulOp. validateAndMaybeEvolveStateSchema(hadoopConf, currentBatchId, stateSchemaVersion) +val stateSchemaPaths = schemaValidationResult.map(_.schemaPath) // write out the state schema paths to the metadata file statefulOp match { - case stateStoreWriter: StateStoreWriter => -val metadata = stateStoreWriter.operatorStateMetadata() -// TODO: [SPARK-48849] Populate metadata with stateSchemaPaths if metadata version is v2 -val metadataWriter = new OperatorStateMetadataWriter(new Path( - checkpointLocation, stateStoreWriter.getStateInfo.operatorId.toString), hadoopConf) + case ssw: StateStoreWriter => +val metadata = ssw.operatorStateMetadata(stateSchemaPaths) Review Comment: I'm a bit concerned about the number of files we are going to write in query lifecycle, but we can defer the discussion and decision to the time we handle purge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator [spark]
HeartSaVioR commented on PR #47445: URL: https://github.com/apache/spark/pull/47445#issuecomment-2250228784 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator [spark]
HeartSaVioR closed pull request #47445: [SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator URL: https://github.com/apache/spark/pull/47445 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SPARK-49002][SQL] Consistently handle invalid location/path values for all database objects [spark]
yaooqinn opened a new pull request, #47485: URL: https://github.com/apache/spark/pull/47485 … ### What changes were proposed in this pull request? We are now consistently handling invalid location/path values for all database objects in this pull request. Before this PR, we only checked for `null` and `""` for a small group of operations, such as `SetNamespaceLocation` and `CreateNamespace`. However, various other commands or queries involved with location did not undergo verification. Besides, we also didn't apply suitable error classes for other syntax errors like `null` and `""`. In this PR, we add a try-catch block to rethrow INVALID_LOCATION errors for `null`, `""` and all other invalid inputs. And all operations for databases, tables, partitions, raw paths are validated. ### Why are the changes needed? For better and consistent path errors ### Does this PR introduce _any_ user-facing change? Yes, INVALID_EMPTY_LOCATION -> INVALID_LOCATION.EMPTY, and IllegalArgumentException thrown by path parsing is replaced with INVALID_LOCATION.SYTAX error ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-32086][YARN]Bug fix for RemoveBroadcast RPC failed after executor is shutdown [spark]
esnhysythh commented on PR #28921: URL: https://github.com/apache/spark/pull/28921#issuecomment-2250349381 This Pull request is dangerous. It may cause deadlock problems (I have experimented with it). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]
hvanhovell commented on code in PR #47464: URL: https://github.com/apache/spark/pull/47464#discussion_r1691630792 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala: ## @@ -519,6 +519,8 @@ object FunctionRegistry { expressionBuilder("mode", ModeBuilder), expression[HllSketchAgg]("hll_sketch_agg"), expression[HllUnionAgg]("hll_union_agg"), +expression[Product]("product"), Review Comment: Ok, let me try something different. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]
hvanhovell commented on code in PR #47464: URL: https://github.com/apache/spark/pull/47464#discussion_r1691639698 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala: ## @@ -519,6 +519,8 @@ object FunctionRegistry { expressionBuilder("mode", ModeBuilder), expression[HllSketchAgg]("hll_sketch_agg"), expression[HllUnionAgg]("hll_union_agg"), +expression[Product]("product"), Review Comment: I will create a separate registry for these functions, and only use them in Dataframe resolution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-49003][SQL][COLLATION] Fix calculating hash value of collated strings [spark]
ilicmarkodb opened a new pull request, #47486: URL: https://github.com/apache/spark/pull/47486 ### What changes were proposed in this pull request? Fixed calculating hash value of collated strings. Changed hashing function to use proper hash for collated strings. ### Why are the changes needed? Because hash function was calculating different hash values for equal collated strings. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests inside `HashExpressionsSuite.scala` ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48901][SPARK-48916][SS][PYTHON] Introduce clusterBy DataStreamWriter API [spark]
chirag-s-db commented on PR #47376: URL: https://github.com/apache/spark/pull/47376#issuecomment-2250755596 @HeartSaVioR https://github.com/apache/spark/pull/47301 has been merged, ready for review again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Mima [spark]
xupefei opened a new pull request, #47487: URL: https://github.com/apache/spark/pull/47487 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Update doc `sql/README.md` [spark]
amaliujia commented on code in PR #47476: URL: https://github.com/apache/spark/pull/47476#discussion_r1691718137 ## sql/README.md: ## @@ -3,7 +3,8 @@ Spark SQL This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API. -Spark SQL is broken up into four subprojects: +Spark SQL is broken up into five subprojects: + - API (sql/api) - Includes some public API like DataType, Row, etc. This component can be shared between Catalyst and Spark Connect client. Review Comment: Thanks for the documentation! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows [spark]
agubichev commented on PR #47388: URL: https://github.com/apache/spark/pull/47388#issuecomment-2250792171 > To extend it a bit more, shall we allow `where func(T_inner.x) = T_outer.date group by T_inner.x` if the `func` guarantees to produce different results for different values of `T_inner.x`? We will look into this as follow-up steps, but ultimately this should be a runtime-level check (checking that the join returns at most 1 row), so that we can allow all kinds of subqueries and let runtime throw a runtime error for bad cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Pyth… [spark]
dongjoon-hyun opened a new pull request, #47488: URL: https://github.com/apache/spark/pull/47488 …on 3.12 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17` to prevent Python 3.12 [spark]
dongjoon-hyun opened a new pull request, #47489: URL: https://github.com/apache/spark/pull/47489 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-49006] Implement purging for OperatorStateMetadataV2 and StateSchemaV3 files [spark]
ericm-db opened a new pull request, #47490: URL: https://github.com/apache/spark/pull/47490 ### What changes were proposed in this pull request? Currently, OperatorStateMetadataV2 and StateSchemaV3 files are written for every new query run. This PR will implement purging files so we only keep `minLogEntriesToMaintain` files per query. ### Why are the changes needed? These changes are needed so that we don't indefinitely keep these files across many query runs, bounding the number of state files we keep ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]
dongjoon-hyun commented on PR #47488: URL: https://github.com/apache/spark/pull/47488#issuecomment-2250942959 cc @yaooqinn and @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]
dongjoon-hyun commented on PR #47489: URL: https://github.com/apache/spark/pull/47489#issuecomment-2250943259 cc @yaooqinn and @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]
dongjoon-hyun commented on PR #47488: URL: https://github.com/apache/spark/pull/47488#issuecomment-2250938296 cc @yaooqinn and @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]
dongjoon-hyun commented on PR #47488: URL: https://github.com/apache/spark/pull/47488#issuecomment-2250955060 Also, cc @huaxingao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]
dongjoon-hyun commented on PR #47489: URL: https://github.com/apache/spark/pull/47489#issuecomment-2250955410 Also, cc @huaxingao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]
dongjoon-hyun commented on PR #47488: URL: https://github.com/apache/spark/pull/47488#issuecomment-2250968487 Thank you so much, @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]
dongjoon-hyun commented on PR #47489: URL: https://github.com/apache/spark/pull/47489#issuecomment-2250972235 Thank you, @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec [spark]
gene-db commented on code in PR #47473: URL: https://github.com/apache/spark/pull/47473#discussion_r1691812112 ## common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java: ## @@ -120,6 +120,12 @@ public class VariantUtil { // Long string value. The content is (4-byte little-endian unsigned integer representing the // string size) + (size bytes of string content). public static final int LONG_STR = 16; + // year-month interval value. The content is one byte representing the start and end field values + // (1 bit each starting at least significant bits) and a 4-byte little-endian signed integer + public static final int YEAR_MONTH_INTERVAL = 19; + // day-time interval value. The content is one byte representing the start and end field values + // (2 bits each starting at least significant bits) and an 8-byte little-endian signed integer Review Comment: Similar to the README, we should document what 0, 1, 2, 3 represent for the start and end fields. ## common/variant/src/main/java/org/apache/spark/types/variant/Variant.java: ## @@ -88,6 +91,16 @@ public long getLong() { return VariantUtil.getLong(value, pos); } + // Get the start and end fields of a year-month interval from the variant. + public IntervalFields getYearMonthIntervalFields() { Review Comment: What uses these apis? ## common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java: ## @@ -377,11 +405,52 @@ public static long getLong(byte[] value, int pos) { case TIMESTAMP: case TIMESTAMP_NTZ: return readLong(value, pos + 1, 8); + case YEAR_MONTH_INTERVAL: +return readLong(value, pos + 2, 4); + case DAY_TIME_INTERVAL: +return readLong(value, pos + 2, 8); default: throw new IllegalStateException(exceptionMessage); } } + // Class used to pass around start and end fields of year-month and day-time interval values. + public static class IntervalFields { +public IntervalFields(byte startField, byte endField) { + this.startField = startField; + this.endField = endField; +} + +public final byte startField; +public final byte endField; + } + + // Get the start and end fields of a variant value representing a year-month interval value. The + // returned array contains the start field at the zeroth index and the end field at the first + // index. + public static IntervalFields getYearMonthIntervalFields(byte[] value, int pos) { +long fieldInfo = readLong(value, pos + 1, 1); +IntervalFields intervalFields = new IntervalFields((byte) (fieldInfo & 0x1), +(byte) ((fieldInfo >> 1) & 0x1)); +if (intervalFields.endField < intervalFields.startField) { Review Comment: Do we need to check the type to be year-month interval, and the length? Some other functions call `checkIndex`. ## common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java: ## @@ -355,9 +380,12 @@ public static boolean getBoolean(byte[] value, int pos) { // Get a long value from variant value `value[pos...]`. // It is only legal to call it if `getType` returns one of `Type.LONG/DATE/TIMESTAMP/ - // TIMESTAMP_NTZ`. If the type is `DATE`, the return value is guaranteed to fit into an int and - // represents the number of days from the Unix epoch. If the type is `TIMESTAMP/TIMESTAMP_NTZ`, - // the return value represents the number of microseconds from the Unix epoch. + // TIMESTAMP_NTZ/YEAR_MONTH_INTERVAL/DAY_TIME_INTERVAL`. If the type is `DATE`, the return value + // is guaranteed to fit into an int and represents the number of days from the Unix epoch. + // If the type is `TIMESTAMP/TIMESTAMP_NTZ`, the return value represents the number of + // microseconds from the Unix epoch. If the type is `YEAR_MONTH_INTERVAL`, the return value Review Comment: We should also mention that the year-month one is guaranteed to fit into an int. ## common/variant/src/main/java/org/apache/spark/types/variant/Variant.java: ## @@ -113,6 +126,11 @@ public String getString() { return VariantUtil.getString(value, pos); } + // Get the type info bits from a variant value. + public int getTypeInfo() { Review Comment: I can't easily tell, but what uses this api? ## common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java: ## @@ -120,6 +120,12 @@ public class VariantUtil { // Long string value. The content is (4-byte little-endian unsigned integer representing the // string size) + (size bytes of string content). public static final int LONG_STR = 16; + // year-month interval value. The content is one byte representing the start and end field values + // (1 bit each starting at least significant bits) and a 4-byte little-endian signed integer Review Comment: Similar to the README, we should document what 0
Re: [PR] [SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec [spark]
harshmotw-db commented on code in PR #47473: URL: https://github.com/apache/spark/pull/47473#discussion_r1691852875 ## common/variant/src/main/java/org/apache/spark/types/variant/Variant.java: ## @@ -113,6 +126,11 @@ public String getString() { return VariantUtil.getString(value, pos); } + // Get the type info bits from a variant value. + public int getTypeInfo() { Review Comment: It is used when throwing an `UNKNOWN_PRIMITIVE_TYPE_IN_VARIANT` exception from scopes which don't directly have access to the type info. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]
dongjoon-hyun commented on PR #47488: URL: https://github.com/apache/spark/pull/47488#issuecomment-2251060362 Let me merge this to recover the CIs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]
dongjoon-hyun closed pull request #47488: [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 URL: https://github.com/apache/spark/pull/47488 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]
dongjoon-hyun commented on PR #47489: URL: https://github.com/apache/spark/pull/47489#issuecomment-2251061801 Let me merge this to recover the CIs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]
dongjoon-hyun closed pull request #47489: [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 URL: https://github.com/apache/spark/pull/47489 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]
dongjoon-hyun commented on PR #47488: URL: https://github.com/apache/spark/pull/47488#issuecomment-2251069589 Thank you, @viirya ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec [spark]
harshmotw-db commented on code in PR #47473: URL: https://github.com/apache/spark/pull/47473#discussion_r1691939585 ## common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java: ## @@ -377,11 +405,52 @@ public static long getLong(byte[] value, int pos) { case TIMESTAMP: case TIMESTAMP_NTZ: return readLong(value, pos + 1, 8); + case YEAR_MONTH_INTERVAL: +return readLong(value, pos + 2, 4); + case DAY_TIME_INTERVAL: +return readLong(value, pos + 2, 8); default: throw new IllegalStateException(exceptionMessage); } } + // Class used to pass around start and end fields of year-month and day-time interval values. + public static class IntervalFields { +public IntervalFields(byte startField, byte endField) { + this.startField = startField; + this.endField = endField; +} + +public final byte startField; +public final byte endField; + } + + // Get the start and end fields of a variant value representing a year-month interval value. The + // returned array contains the start field at the zeroth index and the end field at the first + // index. + public static IntervalFields getYearMonthIntervalFields(byte[] value, int pos) { +long fieldInfo = readLong(value, pos + 1, 1); +IntervalFields intervalFields = new IntervalFields((byte) (fieldInfo & 0x1), +(byte) ((fieldInfo >> 1) & 0x1)); +if (intervalFields.endField < intervalFields.startField) { Review Comment: Yes, that would have been a good idea. `checkIndex` is performed by `readLong` anyway though. Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec [spark]
harshmotw-db commented on code in PR #47473: URL: https://github.com/apache/spark/pull/47473#discussion_r1691942388 ## common/variant/src/main/java/org/apache/spark/types/variant/Variant.java: ## @@ -88,6 +91,16 @@ public long getLong() { return VariantUtil.getLong(value, pos); } + // Get the start and end fields of a year-month interval from the variant. + public IntervalFields getYearMonthIntervalFields() { Review Comment: `schemaOf`, `cast from variant` and `to_json` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [wip]Metadata vcf [spark]
ericm-db closed pull request #47446: [wip]Metadata vcf URL: https://github.com/apache/spark/pull/47446 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48821][SQL] Support Update in DataFrameWriterV2 [spark]
szehon-ho commented on PR #47233: URL: https://github.com/apache/spark/pull/47233#issuecomment-2251363935 Changed the api to be: ``` spark.table(tableNameAsString) .update(Map("salary" -> lit(-1))) .where($"pk" >= 2) .execute() ``` Can add option in a follow-up pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48996][SQL][PYTHON] Allow bare literals for __and__ and __or__ of Column [spark]
ueshin commented on PR #47474: URL: https://github.com/apache/spark/pull/47474#issuecomment-2251366183 The failure seems not related to this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48996][SQL][PYTHON] Allow bare literals for __and__ and __or__ of Column [spark]
ueshin commented on PR #47474: URL: https://github.com/apache/spark/pull/47474#issuecomment-2251366450 Thanks! merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48996][SQL][PYTHON] Allow bare literals for __and__ and __or__ of Column [spark]
ueshin closed pull request #47474: [SPARK-48996][SQL][PYTHON] Allow bare literals for __and__ and __or__ of Column URL: https://github.com/apache/spark/pull/47474 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-49007][CORE] Improve `MasterPage` to support custom title [spark]
dongjoon-hyun opened a new pull request, #47491: URL: https://github.com/apache/spark/pull/47491 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48628][CORE] Add task peak on/off heap memory metrics [spark]
liuzqt commented on code in PR #47192: URL: https://github.com/apache/spark/pull/47192#discussion_r1692144786 ## core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala: ## @@ -110,9 +112,22 @@ class TaskMetrics private[spark] () extends Serializable { * joins. The value of this accumulator should be approximately the sum of the peak sizes * across all such data structures created in this task. For SQL jobs, this only tracks all * unsafe operators and ExternalSort. + * This is not equal to peakOnHeapExecutionMemory + peakOffHeapExecutionMemory */ + // TODO: SPARK-48789: the naming is confusing since this does not really reflect the whole + // execution memory. We'd better deprecate this once we have a replacement. def peakExecutionMemory: Long = _peakExecutionMemory.sum + /** + * Peak on heap execution memory as tracked by TaskMemoryManager. + */ + def peakOnHeapExecutionMemory: Long = _peakOnHeapExecutionMemory.sum + + /** + * Peak off heap execution memory as tracked by TaskMemoryManager. + */ + def peakOffHeapExecutionMemory: Long = _peakOffHeapExecutionMemory.sum Review Comment: peakExecutionMemory <= peakOnHeapExecutionMemory + peakOffHeapExecutionMemory? I think yes, because`TaskMemoryManager.acquireExecutionMemory` is the only narrow waist for any execution memory acquisition and we maintain the memory here. Instead, the legacy `peakExecutionMemory` is maintained in some operators (join, agg, sort), which is totally up to operator implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]
dongjoon-hyun commented on PR #47489: URL: https://github.com/apache/spark/pull/47489#issuecomment-2251424159 Thank you, @viirya ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49007][CORE] Improve `MasterPage` to support custom title [spark]
dongjoon-hyun commented on PR #47491: URL: https://github.com/apache/spark/pull/47491#issuecomment-2251426017 Could you review this PR, @huaxingao ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48628][CORE] Add task peak on/off heap memory metrics [spark]
liuzqt commented on PR #47192: URL: https://github.com/apache/spark/pull/47192#issuecomment-2251439038 > Take a look at `peakExecutionMemory` within spark-core. We should be exposing the new metrics as part of the api - both at task level, and at stage level (distributions for ex). We should definitely expose this to api. But can we land this core change and then make it to API/UI in follow PRs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Rebased value state [spark]
jingz-db opened a new pull request, #47492: URL: https://github.com/apache/spark/pull/47492 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-49008][PYTHON] Use `ParamSpec` to propagate `func` signature in `transform` [spark]
nicklamiller opened a new pull request, #47493: URL: https://github.com/apache/spark/pull/47493 ### What changes were proposed in this pull request? Propagate function signature of `func` in `DataFrame(...).transform(...)`. ### Why are the changes needed? Propagating the function signature for `func` in `DataFrame(...).transform(...)` enables type checkers like `mypy` to enforce that `func` is being correctly called through `DataFrame(...).transform(...)`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran example script that passes `add_num` to `transform` with inappropriate arguments: - first running mypy on `master`, showing that no mypy errors are raised - then running mypy with the changes in this PR on the `type-hint-transform-args-kwargs` branch - this shows that the exptected mypy errors are raised screenshot of no mypy errors on master https://github.com/user-attachments/assets/a2bb01b2-8ca8-41d6-a50a-fe95d7a1cfd7";> screenshot of expected mypy errors with PR changes https://github.com/user-attachments/assets/8d7920cf-e9ad-42a0-b435-1f55b0469f29";> example script ```python from pyspark.sql import DataFrame, functions as F, SparkSession spark = ( SparkSession .builder .appName("Python Spark SQL basic example") .getOrCreate() ) df = spark.createDataFrame([("a", 0), ("b", 1)], schema=["col1", "col2"]) def add_num(df: DataFrame, in_colname: str, *, num: int) -> DataFrame: return df.withColumn("new_col", F.col(in_colname) + num) if __name__=="__main__": df.transform(add_num, "col2", 2).show() # enforces kw df.transform(add_num, in_colname="col2", num="a").show() # enforces type for kwarg df.transform(add_num, in_colname=2, num=2).show() # enforces type for arg df.transform(add_num, "col2").show() # enforces required args ``` ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48628][CORE] Add task peak on/off heap memory metrics [spark]
JoshRosen commented on code in PR #47192: URL: https://github.com/apache/spark/pull/47192#discussion_r1692177886 ## core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala: ## @@ -110,9 +112,22 @@ class TaskMetrics private[spark] () extends Serializable { * joins. The value of this accumulator should be approximately the sum of the peak sizes * across all such data structures created in this task. For SQL jobs, this only tracks all * unsafe operators and ExternalSort. + * This is not equal to peakOnHeapExecutionMemory + peakOffHeapExecutionMemory */ + // TODO: SPARK-48789: the naming is confusing since this does not really reflect the whole + // execution memory. We'd better deprecate this once we have a replacement. def peakExecutionMemory: Long = _peakExecutionMemory.sum + /** + * Peak on heap execution memory as tracked by TaskMemoryManager. + */ + def peakOnHeapExecutionMemory: Long = _peakOnHeapExecutionMemory.sum + + /** + * Peak off heap execution memory as tracked by TaskMemoryManager. + */ + def peakOffHeapExecutionMemory: Long = _peakOffHeapExecutionMemory.sum Review Comment: +1, I agree that the `peakExecutionMemory <= peakOnHeapExecutionMemory + peakOffHeapExecutionMemory` should hold: If we trace through the existing callers of `incPeakExecutionMemory` it looks like all of the usages flow from counts that correspond to the acquireExecutionMemory waist. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49007][CORE] Improve `MasterPage` to support custom title [spark]
dongjoon-hyun commented on PR #47491: URL: https://github.com/apache/spark/pull/47491#issuecomment-2251459571 Thank you, @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]
bogao007 commented on PR #47133: URL: https://github.com/apache/spark/pull/47133#issuecomment-2251495351 Hi @HyukjinKwon, I'm getting below testing errors with my PR: ``` == ERROR [0.255s]: test_termination_sigterm (pyspark.tests.test_daemon.DaemonTests.test_termination_sigterm) Ensure that daemon and workers terminate on SIGTERM. -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/tests/test_daemon.py", line 77, in test_termination_sigterm self.do_termination_test(lambda daemon: os.kill(daemon.pid, SIGTERM)) File "/__w/spark/spark/python/pyspark/tests/test_daemon.py", line 49, in do_termination_test port = read_int(daemon.stdout) ^^^ File "/__w/spark/spark/python/pyspark/serializers.py", line 597, in read_int raise EOFError EOFError ``` But if I revert changes in `python/pyspark/sql/pandas/group_ops.py` and `python/pyspark/sql/streaming/__init__.py`, it would succeed. Do you know if this is related to my newly added file `stateful_processor.py`? Not sure why this is related though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-49010] Add unit tests for XML schema inference case sensitivity [spark]
shujingyang-db opened a new pull request, #47494: URL: https://github.com/apache/spark/pull/47494 ### What changes were proposed in this pull request? Currently, XML respects the case sensitivity SQLConf (default to false) in the schema inference but we lack unit tests to verify the behavior. This PR adds more unit tests to it. ### Why are the changes needed? This is a test-only change. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a test-only change. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-49009][SQL][PYTHON] Make Column APIs and functions accept Enums [spark]
ueshin opened a new pull request, #47495: URL: https://github.com/apache/spark/pull/47495 ### What changes were proposed in this pull request? Make Column APIs and functions accept `Enum`s. ### Why are the changes needed? `Enum`s can be accepted in Column APIs and functions using its `value`. ```py >>> from pyspark.sql import functions as F >>> from enum import Enum >>> class A(Enum): ... X = "x" ... Y = "y" ... >>> F.lit(A.X) Column<'x'> >>> F.lit(A.X) + A.Y Column<'`+`(x, y)'> ``` ### Does this PR introduce _any_ user-facing change? Yes, Python's `Enum`s will be used as literal values. ### How was this patch tested? Added the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47829][SQL] Text Datasource supports Zstd compression codec [spark]
github-actions[bot] closed pull request #46026: [SPARK-47829][SQL] Text Datasource supports Zstd compression codec URL: https://github.com/apache/spark/pull/46026 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]
github-actions[bot] commented on PR #45446: URL: https://github.com/apache/spark/pull/45446#issuecomment-2251630213 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47279][CORE]When the messageLoop encounter a fatal exception, such as oom, exit the JVM to avoid the driver hanging forever [spark]
github-actions[bot] commented on PR #45385: URL: https://github.com/apache/spark/pull/45385#issuecomment-2251630277 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47835][NETWORK] Remove switch for remoteReadNioBufferConversion [spark]
github-actions[bot] closed pull request #46030: [SPARK-47835][NETWORK] Remove switch for remoteReadNioBufferConversion URL: https://github.com/apache/spark/pull/46030 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]
hvanhovell commented on code in PR #47464: URL: https://github.com/apache/spark/pull/47464#discussion_r1692311896 ## connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -1893,33 +1855,6 @@ class SparkConnectPlanner( val unit = extractString(children(0), "unit") Some(TimestampAdd(unit, children(1), children(2))) - case "window" if Seq(2, 3, 4).contains(fun.getArgumentsCount) => -val children = fun.getArgumentsList.asScala.map(transformExpression) -val timeCol = children.head -val windowDuration = extractString(children(1), "windowDuration") -var slideDuration = windowDuration -if (fun.getArgumentsCount >= 3) { - slideDuration = extractString(children(2), "slideDuration") -} -var startTime = "0 second" -if (fun.getArgumentsCount == 4) { - startTime = extractString(children(3), "startTime") -} -Some( - Alias(TimeWindow(timeCol, windowDuration, slideDuration, startTime), "window")( -nonInheritableMetadataKeys = Seq(Dataset.DATASET_ID_KEY, Dataset.COL_POS_KEY))) - - case "session_window" if fun.getArgumentsCount == 2 => -val children = fun.getArgumentsList.asScala.map(transformExpression) -val timeCol = children.head -val sessionWindow = children.last match { - case Literal(s, StringType) if s != null => SessionWindow(timeCol, s.toString) - case other => SessionWindow(timeCol, other) -} -Some( - Alias(sessionWindow, "session_window")(nonInheritableMetadataKeys = Review Comment: Why do you explicitly remove the `Dataset.DATASET_ID_KEY` and `Dataset.COL_POS_KEY`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]
hvanhovell commented on code in PR #47464: URL: https://github.com/apache/spark/pull/47464#discussion_r1692314400 ## sql/core/src/main/scala/org/apache/spark/sql/functions.scala: ## @@ -5705,11 +5693,8 @@ object functions { * @group datetime_funcs * @since 3.2.0 */ - def session_window(timeColumn: Column, gapDuration: String): Column = { -withExpr { - SessionWindow(timeColumn.expr, gapDuration) -}.as("session_window") - } + def session_window(timeColumn: Column, gapDuration: String): Column = +session_window(timeColumn, lit(gapDuration)) Review Comment: This is now supported because of the small change I made to ResolveTimeWindows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]
hvanhovell commented on code in PR #47464: URL: https://github.com/apache/spark/pull/47464#discussion_r1692315764 ## sql/core/src/main/scala/org/apache/spark/sql/functions.scala: ## @@ -5743,7 +5728,7 @@ object functions { * @since 3.2.0 */ def session_window(timeColumn: Column, gapDuration: Column): Column = -Column.fn("session_window", timeColumn, gapDuration).as("session_window") Review Comment: Alias is not needed. This is added by the ResolveTimeWindows rule. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]
hvanhovell commented on code in PR #47464: URL: https://github.com/apache/spark/pull/47464#discussion_r1692320832 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveTimeWindows.scala: ## @@ -257,10 +259,11 @@ object SessionWindowing extends Rule[LogicalPlan] { case s: SessionWindow => sessionAttr } -val filterByTimeRange = session.gapDuration match { - case Literal(interval: CalendarInterval, CalendarIntervalType) => -interval == null || interval.months + interval.days + interval.microseconds <= 0 - case _ => true +val filterByTimeRange = if (gapDuration.foldable) { Review Comment: This is a bit better than pattern matching on a literal since it also supports more complex expressions, in particular `Cast(Literal("r", StringType), CalendarIntervalType)`. Normally you'd expect the constant folding to take care of this, but this happens during optimization, and this rule is part of analysis; hence the need for manual constant folding. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveTimeWindows.scala: ## @@ -257,10 +259,11 @@ object SessionWindowing extends Rule[LogicalPlan] { case s: SessionWindow => sessionAttr } -val filterByTimeRange = session.gapDuration match { - case Literal(interval: CalendarInterval, CalendarIntervalType) => -interval == null || interval.months + interval.days + interval.microseconds <= 0 - case _ => true +val filterByTimeRange = if (gapDuration.foldable) { Review Comment: This is a bit better than pattern matching on a literal since it supports more complex expressions, in particular `Cast(Literal("r", StringType), CalendarIntervalType)`. Normally you'd expect the constant folding to take care of this, but this happens during optimization, and this rule is part of analysis; hence the need for manual constant folding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]
hvanhovell commented on code in PR #47464: URL: https://github.com/apache/spark/pull/47464#discussion_r1692311896 ## connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -1893,33 +1855,6 @@ class SparkConnectPlanner( val unit = extractString(children(0), "unit") Some(TimestampAdd(unit, children(1), children(2))) - case "window" if Seq(2, 3, 4).contains(fun.getArgumentsCount) => -val children = fun.getArgumentsList.asScala.map(transformExpression) -val timeCol = children.head -val windowDuration = extractString(children(1), "windowDuration") -var slideDuration = windowDuration -if (fun.getArgumentsCount >= 3) { - slideDuration = extractString(children(2), "slideDuration") -} -var startTime = "0 second" -if (fun.getArgumentsCount == 4) { - startTime = extractString(children(3), "startTime") -} -Some( - Alias(TimeWindow(timeCol, windowDuration, slideDuration, startTime), "window")( -nonInheritableMetadataKeys = Seq(Dataset.DATASET_ID_KEY, Dataset.COL_POS_KEY))) - - case "session_window" if fun.getArgumentsCount == 2 => -val children = fun.getArgumentsList.asScala.map(transformExpression) -val timeCol = children.head -val sessionWindow = children.last match { - case Literal(s, StringType) if s != null => SessionWindow(timeCol, s.toString) - case other => SessionWindow(timeCol, other) -} -Some( - Alias(sessionWindow, "session_window")(nonInheritableMetadataKeys = Review Comment: @zhengruifeng Why do you explicitly remove the `Dataset.DATASET_ID_KEY` and `Dataset.COL_POS_KEY`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]
HyukjinKwon commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1692328985 ## python/pyspark/sql/pandas/group_ops.py: ## @@ -358,6 +364,140 @@ def applyInPandasWithState( ) return DataFrame(jdf, self.session) +def transformWithStateInPandas( +self, +statefulProcessor: StatefulProcessor, +outputStructType: Union[StructType, str], +outputMode: str, +timeMode: str, +) -> DataFrame: +""" +Invokes methods defined in the stateful processor used in arbitrary state API v2. +We allow the user to act on per-group set of input rows along with keyed state and the +user can choose to output/return 0 or more rows. + +For a streaming dataframe, we will repeatedly invoke the interface methods for new rows +in each trigger and the user's state/state variables will be stored persistently across +invocations. + +The `statefulProcessor` should be a Python class that implements the interface defined in +pyspark.sql.streaming.stateful_processor.StatefulProcessor. + +The `outputStructType` should be a :class:`StructType` describing the schema of all +elements in the returned value, `pandas.DataFrame`. The column labels of all elements in +returned `pandas.DataFrame` must either match the field names in the defined schema if +specified as strings, or match the field data types by position if not strings, +e.g. integer indices. + +The size of each `pandas.DataFrame` in both the input and output can be arbitrary. The +number of `pandas.DataFrame` in both the input and output can also be arbitrary. + +.. versionadded:: 4.0.0 + +Parameters +-- +statefulProcessor : :class:`pyspark.sql.streaming.stateful_processor.StatefulProcessor` +Instance of StatefulProcessor whose functions will be invoked by the operator. +outputStructType : :class:`pyspark.sql.types.DataType` or str +The type of the output records. The value can be either a +:class:`pyspark.sql.types.DataType` object or a DDL-formatted type string. +outputMode : str +The output mode of the stateful processor. +timeMode : str +The time mode semantics of the stateful processor for timers and TTL. + +Examples + +>>> import pandas as pd +>>> from pyspark.sql import Row +>>> from pyspark.sql.streaming import StatefulProcessor, StatefulProcessorHandle +>>> from pyspark.sql.types import StructType, StructField, LongType, StringType, IntegerType +>>> from typing import Iterator +>>> from pyspark.sql.functions import split, col +>>> spark.conf.set("spark.sql.streaming.stateStore.providerClass","org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider") +>>> output_schema = StructType([ +... StructField("id", StringType(), True), +... StructField("count", IntegerType(), True) +... ]) +>>> state_schema = StructType([ +... StructField("value", IntegerType(), True) +... ]) +>>> class SimpleStatefulProcessor(StatefulProcessor): +... def init(self, handle: StatefulProcessorHandle): Review Comment: 4 spaced indentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]
HyukjinKwon commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1692326848 ## python/pyspark/sql/streaming/stateful_processor.py: ## @@ -0,0 +1,180 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from abc import ABC, abstractmethod +from typing import Any, TYPE_CHECKING, Iterator, Union + +from pyspark.sql.streaming.stateful_processor_api_client import StatefulProcessorApiClient +from pyspark.sql.streaming.value_state_client import ValueStateClient + +import pandas as pd +from pyspark.sql.types import ( +StructType, +StructField, +IntegerType, +LongType, +ShortType, +FloatType, +DoubleType, +DecimalType, +StringType, +BooleanType, +DateType, +TimestampType, +) + +__all__ = ["StatefulProcessor", "StatefulProcessorHandle"] + +if TYPE_CHECKING: +from pyspark.sql.pandas._typing import DataFrameLike as PandasDataFrameLike + + +class ValueState: +""" +Class used for arbitrary stateful operations with the v2 API to capture single value state. +""" + +def __init__( +self, value_state_client: ValueStateClient, state_name: str, schema: Union[StructType, str] +) -> None: +self._value_state_client = value_state_client +self._state_name = state_name +self.schema = schema + +def exists(self) -> bool: +""" +Whether state exists or not. + +.. versionadded:: 4.0.0 Review Comment: Adding it at class level docstring should be enough -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]
HyukjinKwon commented on code in PR #47133: URL: https://github.com/apache/spark/pull/47133#discussion_r1692329891 ## python/pyspark/sql/pandas/group_ops.py: ## @@ -358,6 +364,140 @@ def applyInPandasWithState( ) return DataFrame(jdf, self.session) +def transformWithStateInPandas( +self, +statefulProcessor: StatefulProcessor, +outputStructType: Union[StructType, str], +outputMode: str, +timeMode: str, +) -> DataFrame: +""" +Invokes methods defined in the stateful processor used in arbitrary state API v2. +We allow the user to act on per-group set of input rows along with keyed state and the +user can choose to output/return 0 or more rows. + +For a streaming dataframe, we will repeatedly invoke the interface methods for new rows +in each trigger and the user's state/state variables will be stored persistently across +invocations. + +The `statefulProcessor` should be a Python class that implements the interface defined in +pyspark.sql.streaming.stateful_processor.StatefulProcessor. + +The `outputStructType` should be a :class:`StructType` describing the schema of all +elements in the returned value, `pandas.DataFrame`. The column labels of all elements in +returned `pandas.DataFrame` must either match the field names in the defined schema if +specified as strings, or match the field data types by position if not strings, +e.g. integer indices. + +The size of each `pandas.DataFrame` in both the input and output can be arbitrary. The +number of `pandas.DataFrame` in both the input and output can also be arbitrary. + +.. versionadded:: 4.0.0 + +Parameters +-- +statefulProcessor : :class:`pyspark.sql.streaming.stateful_processor.StatefulProcessor` +Instance of StatefulProcessor whose functions will be invoked by the operator. +outputStructType : :class:`pyspark.sql.types.DataType` or str +The type of the output records. The value can be either a +:class:`pyspark.sql.types.DataType` object or a DDL-formatted type string. +outputMode : str +The output mode of the stateful processor. +timeMode : str +The time mode semantics of the stateful processor for timers and TTL. + +Examples + +>>> import pandas as pd +>>> from pyspark.sql import Row +>>> from pyspark.sql.streaming import StatefulProcessor, StatefulProcessorHandle +>>> from pyspark.sql.types import StructType, StructField, LongType, StringType, IntegerType +>>> from typing import Iterator +>>> from pyspark.sql.functions import split, col Review Comment: let's fix the import according to PEP 8 import order -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48829][BUILD] Upgrade `RoaringBitmap` to 1.2.1 [spark]
panbingkun commented on PR #47247: URL: https://github.com/apache/spark/pull/47247#issuecomment-2251747973 > ready to go? Yea, ready for review, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]
hvanhovell commented on code in PR #47464: URL: https://github.com/apache/spark/pull/47464#discussion_r1692345753 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala: ## @@ -767,6 +771,7 @@ object FunctionRegistry { expression[EqualNull]("equal_null"), expression[HllSketchEstimate]("hll_sketch_estimate"), expression[HllUnion]("hll_union"), +expression[UnwrapUDT]("unwrap_udt"), Review Comment: Moved this out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]
hvanhovell commented on code in PR #47464: URL: https://github.com/apache/spark/pull/47464#discussion_r1692345905 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala: ## @@ -519,6 +519,8 @@ object FunctionRegistry { expressionBuilder("mode", ModeBuilder), expression[HllSketchAgg]("hll_sketch_agg"), expression[HllUnionAgg]("hll_union_agg"), +expression[Product]("product"), +expression[CollectTopK]("collect_top_k"), Review Comment: Moved this out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [ONLY TEST][HOLD] Upgrade rocksdbjni to 9.4.0 [spark]
panbingkun commented on PR #47207: URL: https://github.com/apache/spark/pull/47207#issuecomment-2251754892 > Has there been any new progress on this one Let's wait a little longer, I think version 9.5 should be released soon https://github.com/user-attachments/assets/c31f7463-8222-4842-9359-eb67ca4c9cec";> Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]
HyukjinKwon commented on PR #47133: URL: https://github.com/apache/spark/pull/47133#issuecomment-2251778289 The test failure doesn't look related to me. Can you reproduce it locally? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49006] Implement purging for OperatorStateMetadataV2 and StateSchemaV3 files [spark]
anishshri-db commented on code in PR #47490: URL: https://github.com/apache/spark/pull/47490#discussion_r1692363879 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala: ## @@ -79,6 +79,38 @@ class IncrementalExecution( StreamingTransformWithStateStrategy :: Nil } + // Methods to enable the use of AsyncLogPurge + protected val minLogEntriesToMaintain: Int = +sparkSession.sessionState.conf.minBatchesToRetain + + val errorNotifier: ErrorNotifier = new ErrorNotifier() + + override protected def purge(threshold: Long): Unit = {} + + override protected def purgeOldest(planWithStateOpId: SparkPlan): Unit = { +planWithStateOpId.collect { + case ssw: StateStoreWriter => +ssw.operatorStateMetadataVersion match { + case 2 => +val fileManager = new OperatorStateMetadataV2FileManager( + new Path(checkpointLocation, ssw.getStateInfo.operatorId.toString), + ssw.stateSchemaFilePath(Some(StateStoreId.DEFAULT_STORE_NAME)), + hadoopConf) + fileManager.keepNEntries(minLogEntriesToMaintain) + case _ => +} + case _ => +} + } + + private def purgeMetadataFiles(planWithStateOpId: SparkPlan): Unit = { +if (useAsyncPurge) { Review Comment: We can just always do async right ? is there a reason to also do sync ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49006] Implement purging for OperatorStateMetadataV2 and StateSchemaV3 files [spark]
anishshri-db commented on code in PR #47490: URL: https://github.com/apache/spark/pull/47490#discussion_r1692364074 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/OperatorStateMetadata.scala: ## @@ -313,3 +314,89 @@ class OperatorStateMetadataV2Reader( OperatorStateMetadataUtils.readMetadata(inputStream) } } + +class OperatorStateMetadataV2FileManager( Review Comment: Can we add a comment for this class ? what are we using this for exactly ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49006] Implement purging for OperatorStateMetadataV2 and StateSchemaV3 files [spark]
anishshri-db commented on code in PR #47490: URL: https://github.com/apache/spark/pull/47490#discussion_r1692364600 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala: ## @@ -79,6 +79,38 @@ class IncrementalExecution( StreamingTransformWithStateStrategy :: Nil } + // Methods to enable the use of AsyncLogPurge + protected val minLogEntriesToMaintain: Int = +sparkSession.sessionState.conf.minBatchesToRetain + + val errorNotifier: ErrorNotifier = new ErrorNotifier() + + override protected def purge(threshold: Long): Unit = {} + + override protected def purgeOldest(planWithStateOpId: SparkPlan): Unit = { +planWithStateOpId.collect { + case ssw: StateStoreWriter => +ssw.operatorStateMetadataVersion match { + case 2 => +val fileManager = new OperatorStateMetadataV2FileManager( + new Path(checkpointLocation, ssw.getStateInfo.operatorId.toString), + ssw.stateSchemaFilePath(Some(StateStoreId.DEFAULT_STORE_NAME)), + hadoopConf) + fileManager.keepNEntries(minLogEntriesToMaintain) Review Comment: The logic here might not be very straight forward. We need to keep the metadata for the earliest retained state basically -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows [spark]
cloud-fan commented on PR #47388: URL: https://github.com/apache/spark/pull/47388#issuecomment-2251791506 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows [spark]
cloud-fan closed pull request #47388: [SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows URL: https://github.com/apache/spark/pull/47388 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45787][SQL] Support Catalog.listColumns for clustering columns [spark]
cloud-fan commented on PR #47451: URL: https://github.com/apache/spark/pull/47451#issuecomment-2251797510 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45787][SQL] Support Catalog.listColumns for clustering columns [spark]
cloud-fan closed pull request #47451: [SPARK-45787][SQL] Support Catalog.listColumns for clustering columns URL: https://github.com/apache/spark/pull/47451 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]
yaooqinn commented on PR #47488: URL: https://github.com/apache/spark/pull/47488#issuecomment-2251851725 Late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]
yaooqinn commented on PR #47489: URL: https://github.com/apache/spark/pull/47489#issuecomment-2251851929 Late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org