Re: [PR] [SPARK-48628][CORE] Add task peak on/off heap memory metrics [spark]

2024-07-25 Thread via GitHub


mridulm commented on code in PR #47192:
URL: https://github.com/apache/spark/pull/47192#discussion_r1690907811


##
core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:
##
@@ -110,9 +112,22 @@ class TaskMetrics private[spark] () extends Serializable {
* joins. The value of this accumulator should be approximately the sum of 
the peak sizes
* across all such data structures created in this task. For SQL jobs, this 
only tracks all
* unsafe operators and ExternalSort.
+   * This is not equal to peakOnHeapExecutionMemory + 
peakOffHeapExecutionMemory
*/
+  // TODO: SPARK-48789: the naming is confusing since this does not really 
reflect the whole
+  //  execution memory. We'd better deprecate this once we have a replacement.
   def peakExecutionMemory: Long = _peakExecutionMemory.sum
 
+  /**
+   * Peak on heap execution memory as tracked by TaskMemoryManager.
+   */
+  def peakOnHeapExecutionMemory: Long = _peakOnHeapExecutionMemory.sum
+
+  /**
+   * Peak off heap execution memory as tracked by TaskMemoryManager.
+   */
+  def peakOffHeapExecutionMemory: Long = _peakOffHeapExecutionMemory.sum

Review Comment:
   Discuss:
   Is it required that `peakExecutionMemory` <= `peakOnHeapExecutionMemory + 
peakOffHeapExecutionMemory` ?
   Any cases where this might get violated ?
   
   I am trying to reason about completeness of these metrics (given we want to 
eventually deprecate the existing one).
   I expect the above to hold, but want to make sure I am not missing anything.
   +CC @JoshRosen 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48844][FOLLOWUP][TESTS] Cleanup duplicated data resource files in hive-thriftserver test [spark]

2024-07-25 Thread via GitHub


yaooqinn commented on code in PR #47480:
URL: https://github.com/apache/spark/pull/47480#discussion_r1690912988


##
sql/core/src/test/resources/sql-tests/inputs/sql-on-files.sql:
##
@@ -1,19 +1,30 @@
+CREATE DATABASE IF NOT EXISTS sql_on_files;
 -- Parquet
+CREATE TABLE sql_on_files.TEST_PARQUET USING PARQUET AS SELECT 1;

Review Comment:
   That makes two of us.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48829][BUILD] Upgrade `RoaringBitmap` to 1.2.1 [spark]

2024-07-25 Thread via GitHub


LuciferYang commented on PR #47247:
URL: https://github.com/apache/spark/pull/47247#issuecomment-2249601828

   ready to go?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [ONLY TEST][HOLD] Upgrade rocksdbjni to 9.4.0 [spark]

2024-07-25 Thread via GitHub


LuciferYang commented on PR #47207:
URL: https://github.com/apache/spark/pull/47207#issuecomment-2249610870

   Has there been any new progress on this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [TEST janino] v3.1.12 VS v3.1.9 [spark]

2024-07-25 Thread via GitHub


LuciferYang commented on PR #47455:
URL: https://github.com/apache/spark/pull/47455#issuecomment-2249631043

   Are there any conclusions from this test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [TEST janino] v3.1.12 VS v3.1.9 [spark]

2024-07-25 Thread via GitHub


panbingkun commented on PR #47455:
URL: https://github.com/apache/spark/pull/47455#issuecomment-2249724363

   > Are there any conclusions from this test?
   
   From the testing, it seems that there is no significant difference in 
performance. 
   Regarding the version of `janino`, we seem to be lagging behind a lot, as 
follows:
   https://github.com/user-attachments/assets/108448fd-d42b-47ed-9774-42e71bd1cffd";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation [spark]

2024-07-25 Thread via GitHub


vladimirg-db closed pull request #47424: [SPARK-48910][SQL] Use HashSet/HashMap 
to avoid linear searches in PreprocessTableCreation
URL: https://github.com/apache/spark/pull/47424


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [TEST janino] v3.1.12 VS v3.1.9 [spark]

2024-07-25 Thread via GitHub


LuciferYang commented on PR #47455:
URL: https://github.com/apache/spark/pull/47455#issuecomment-2249732282

   Since this component is very critical, if there is no noticeable performance 
enhancement or critical bug fix, I recommend maintaining the use of the current 
version.
   cc @cloud-fan @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation [spark]

2024-07-25 Thread via GitHub


vladimirg-db opened a new pull request, #47484:
URL: https://github.com/apache/spark/pull/47484

   
   
   ### What changes were proposed in this pull request?
   
   
   Use `HashSet`/`HashMap` instead of doing linear searches over the `Seq`. In 
case of 1000s of partitions this significantly improves the performance.
   
   
   ### Why are the changes needed?
   
   
   To avoid the O(n*m) passes in the `PreprocessTableCreation`
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   
   ### How was this patch tested?
   
   
   Existing UTs
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation [spark]

2024-07-25 Thread via GitHub


vladimirg-db commented on PR #47424:
URL: https://github.com/apache/spark/pull/47424#issuecomment-2249742061

   Recreated my form again... Also deleted apache-spark-ci-image
   https://github.com/apache/spark/pull/47484


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy [spark]

2024-07-25 Thread via GitHub


cloud-fan commented on PR #47483:
URL: https://github.com/apache/spark/pull/47483#issuecomment-2249767736

   This fixes a regression caused by 
https://github.com/apache/spark/pull/46565/files#diff-fbc6da30b8372e4f9aeb35ccf0d39eb796715d192c7eaeab109376584de0790eR121
 , and make Delta Lake pass all tests. I think we should include it in 3.5.2. 
cc @yaooqinn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy [spark]

2024-07-25 Thread via GitHub


yaooqinn commented on PR #47483:
URL: https://github.com/apache/spark/pull/47483#issuecomment-2249780123

   Do we need this for branch 3.4?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy [spark]

2024-07-25 Thread via GitHub


yaooqinn commented on PR #47483:
URL: https://github.com/apache/spark/pull/47483#issuecomment-2249812967

   Merged to branch-3.5, thank you all


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy [spark]

2024-07-25 Thread via GitHub


yaooqinn closed pull request #47483: [SPARK-48308][CORE][3.5] Unify getting 
data schema without partition columns in FileSourceStrategy
URL: https://github.com/apache/spark/pull/47483


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [TEST janino] v3.1.12 VS v3.1.9 [spark]

2024-07-25 Thread via GitHub


panbingkun commented on PR #47455:
URL: https://github.com/apache/spark/pull/47455#issuecomment-2249830972

   Actually, it has some bug fixes, such as typical ones:
   [spark compilation failed with 
ArrayIndexOutOfBoundsException](https://github.com/janino-compiler/janino/issues/208)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [TEST janino] v3.1.12 VS v3.1.9 [spark]

2024-07-25 Thread via GitHub


LuciferYang commented on PR #47455:
URL: https://github.com/apache/spark/pull/47455#issuecomment-2249837072

   > Actually, it has some bug fixes, such as typical ones: [Fixed issue `spark 
compilation failed with 
ArrayIndexOutOfBoundsException`](https://github.com/janino-compiler/janino/issues/208)
   
   Can we add the corresponding test cases in the current pr ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation [spark]

2024-07-25 Thread via GitHub


vladimirg-db commented on code in PR #47484:
URL: https://github.com/apache/spark/pull/47484#discussion_r1691116665


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##
@@ -248,10 +249,15 @@ case class PreprocessTableCreation(catalog: 
SessionCatalog) extends Rule[Logical
 DDLUtils.checkTableColumns(tableDesc.copy(schema = 
analyzedQuery.schema))
 
 val output = analyzedQuery.output
-val partitionAttrs = normalizedTable.partitionColumnNames.map { 
partCol =>
-  output.find(_.name == partCol).get

Review Comment:
   @HyukjinKwon do you know if the original code is correct? The attribute 
references here are compared by their names and not ids



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy [spark]

2024-07-25 Thread via GitHub


cloud-fan commented on PR #47483:
URL: https://github.com/apache/spark/pull/47483#issuecomment-2249894610

   It's fine to skip 3.4 as https://github.com/apache/spark/pull/46565 was not 
merged to 3.4 either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48844][FOLLOWUP][TESTS] Cleanup duplicated data resource files in hive-thriftserver test [spark]

2024-07-25 Thread via GitHub


yaooqinn closed pull request #47480: [SPARK-48844][FOLLOWUP][TESTS] Cleanup 
duplicated data resource files in hive-thriftserver test
URL: https://github.com/apache/spark/pull/47480


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy [spark]

2024-07-25 Thread via GitHub


yaooqinn commented on PR #47483:
URL: https://github.com/apache/spark/pull/47483#issuecomment-2249906895

   Thank you @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48844][FOLLOWUP][TESTS] Cleanup duplicated data resource files in hive-thriftserver test [spark]

2024-07-25 Thread via GitHub


yaooqinn commented on PR #47480:
URL: https://github.com/apache/spark/pull/47480#issuecomment-2249910142

   Merged to master, thank you @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation [spark]

2024-07-25 Thread via GitHub


vladimirg-db commented on PR #47484:
URL: https://github.com/apache/spark/pull/47484#issuecomment-2250017548

   @HyukjinKwon hi. Finally managed to make the Actions in my fork work. Tests 
passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48344][SQL] SQL API change to support execution of compound statements [spark]

2024-07-25 Thread via GitHub


miland-db commented on code in PR #47403:
URL: https://github.com/apache/spark/pull/47403#discussion_r1691277611


##
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala:
##
@@ -73,11 +74,19 @@ case class SqlScriptingInterpreter() {
   .map(new SingleStatementExec(_, Origin(), isInternal = true))
   .reverse
 new CompoundBodyExec(
-  body.collection.map(st => transformTreeIntoExecutable(st)) ++ 
dropVariables)
+  body.collection.map(st => transformTreeIntoExecutable(st)) ++ 
dropVariables, session)
   case sparkStatement: SingleStatement =>
 new SingleStatementExec(
   sparkStatement.parsedPlan,
   sparkStatement.origin,
-  isInternal = false)
+  shouldCollectResult = true)
 }
+
+  def execute(compoundBody: CompoundBody): Iterator[Array[Row]] = {

Review Comment:
   For now, because we don't know which one is the last statement to do only 
one `collect()` and use noop sink for the rest, we decided to do collect for 
all statements.
   
   Either way, it is important to execute each statement as soon as we 
encounter it to be able to handle errors properly. [PR introducing 
handlers](https://github.com/apache/spark/pull/47423) is currently work in 
progress and will probably explain why we did things the way we did in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][DOCS] Update doc `sql/README.md` [spark]

2024-07-25 Thread via GitHub


HyukjinKwon commented on PR #47476:
URL: https://github.com/apache/spark/pull/47476#issuecomment-2250101451

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation [spark]

2024-07-25 Thread via GitHub


vladimirg-db commented on code in PR #47484:
URL: https://github.com/apache/spark/pull/47484#discussion_r1691291589


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala:
##
@@ -248,10 +249,15 @@ case class PreprocessTableCreation(catalog: 
SessionCatalog) extends Rule[Logical
 DDLUtils.checkTableColumns(tableDesc.copy(schema = 
analyzedQuery.schema))
 
 val output = analyzedQuery.output
-val partitionAttrs = normalizedTable.partitionColumnNames.map { 
partCol =>
-  output.find(_.name == partCol).get

Review Comment:
   Ok nvm, should be fine, since `partitionColumnNames` are just strings



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][DOCS] Update doc `sql/README.md` [spark]

2024-07-25 Thread via GitHub


HyukjinKwon closed pull request #47476: [MINOR][DOCS] Update doc  
`sql/README.md`
URL: https://github.com/apache/spark/pull/47476


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator [spark]

2024-07-25 Thread via GitHub


HeartSaVioR commented on code in PR #47445:
URL: https://github.com/apache/spark/pull/47445#discussion_r1691375121


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala:
##
@@ -208,13 +208,16 @@ class IncrementalExecution(
 }
 val schemaValidationResult = statefulOp.
   validateAndMaybeEvolveStateSchema(hadoopConf, currentBatchId, 
stateSchemaVersion)
+val stateSchemaPaths = schemaValidationResult.map(_.schemaPath)
 // write out the state schema paths to the metadata file
 statefulOp match {
-  case stateStoreWriter: StateStoreWriter =>
-val metadata = stateStoreWriter.operatorStateMetadata()
-// TODO: [SPARK-48849] Populate metadata with stateSchemaPaths if 
metadata version is v2
-val metadataWriter = new OperatorStateMetadataWriter(new Path(
-  checkpointLocation, 
stateStoreWriter.getStateInfo.operatorId.toString), hadoopConf)
+  case ssw: StateStoreWriter =>
+val metadata = ssw.operatorStateMetadata(stateSchemaPaths)

Review Comment:
   I'm a bit concerned about the number of files we are going to write in query 
lifecycle, but we can defer the discussion and decision to the time we handle 
purge.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator [spark]

2024-07-25 Thread via GitHub


HeartSaVioR commented on PR #47445:
URL: https://github.com/apache/spark/pull/47445#issuecomment-2250228784

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator [spark]

2024-07-25 Thread via GitHub


HeartSaVioR closed pull request #47445: [SPARK-48849][SS]Create 
OperatorStateMetadataV2 for the TransformWithStateExec operator
URL: https://github.com/apache/spark/pull/47445


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SPARK-49002][SQL] Consistently handle invalid location/path values for all database objects [spark]

2024-07-25 Thread via GitHub


yaooqinn opened a new pull request, #47485:
URL: https://github.com/apache/spark/pull/47485

   … 
   
   
   
   ### What changes were proposed in this pull request?
   
   
   We are now consistently handling invalid location/path values for all 
database objects in this pull request. Before this PR, we only checked for 
`null` and `""` for a small group of operations, such as `SetNamespaceLocation` 
and `CreateNamespace`. However, various other commands or queries involved with 
location did not undergo verification. Besides, we also didn't apply suitable 
error classes for other syntax errors like `null` and `""`.
   
   In this PR, we add a try-catch block to rethrow INVALID_LOCATION errors for 
`null`, `""` and all other invalid inputs. And all operations for databases, 
tables, partitions, raw paths are validated. 
   
   
   ### Why are the changes needed?
   
   For better and consistent path errors
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, INVALID_EMPTY_LOCATION -> INVALID_LOCATION.EMPTY, and 
IllegalArgumentException thrown by path parsing is replaced with 
INVALID_LOCATION.SYTAX error
   
   ### How was this patch tested?
   
   new tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-32086][YARN]Bug fix for RemoveBroadcast RPC failed after executor is shutdown [spark]

2024-07-25 Thread via GitHub


esnhysythh commented on PR #28921:
URL: https://github.com/apache/spark/pull/28921#issuecomment-2250349381

   This Pull request is dangerous. It may cause deadlock problems (I have 
experimented with it).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]

2024-07-25 Thread via GitHub


hvanhovell commented on code in PR #47464:
URL: https://github.com/apache/spark/pull/47464#discussion_r1691630792


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala:
##
@@ -519,6 +519,8 @@ object FunctionRegistry {
 expressionBuilder("mode", ModeBuilder),
 expression[HllSketchAgg]("hll_sketch_agg"),
 expression[HllUnionAgg]("hll_union_agg"),
+expression[Product]("product"),

Review Comment:
   Ok, let me try something different.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]

2024-07-25 Thread via GitHub


hvanhovell commented on code in PR #47464:
URL: https://github.com/apache/spark/pull/47464#discussion_r1691639698


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala:
##
@@ -519,6 +519,8 @@ object FunctionRegistry {
 expressionBuilder("mode", ModeBuilder),
 expression[HllSketchAgg]("hll_sketch_agg"),
 expression[HllUnionAgg]("hll_union_agg"),
+expression[Product]("product"),

Review Comment:
   I will create a separate registry for these functions, and only use them in 
Dataframe resolution.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-49003][SQL][COLLATION] Fix calculating hash value of collated strings [spark]

2024-07-25 Thread via GitHub


ilicmarkodb opened a new pull request, #47486:
URL: https://github.com/apache/spark/pull/47486

   ### What changes were proposed in this pull request?
   Fixed calculating hash value of collated strings. 
   Changed hashing function to use proper hash for collated strings.
   
   ### Why are the changes needed?
   Because hash function was calculating different hash values for equal 
collated strings.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   New tests inside `HashExpressionsSuite.scala`
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48901][SPARK-48916][SS][PYTHON] Introduce clusterBy DataStreamWriter API [spark]

2024-07-25 Thread via GitHub


chirag-s-db commented on PR #47376:
URL: https://github.com/apache/spark/pull/47376#issuecomment-2250755596

   @HeartSaVioR https://github.com/apache/spark/pull/47301 has been merged, 
ready for review again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] Mima [spark]

2024-07-25 Thread via GitHub


xupefei opened a new pull request, #47487:
URL: https://github.com/apache/spark/pull/47487

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][DOCS] Update doc `sql/README.md` [spark]

2024-07-25 Thread via GitHub


amaliujia commented on code in PR #47476:
URL: https://github.com/apache/spark/pull/47476#discussion_r1691718137


##
sql/README.md:
##
@@ -3,7 +3,8 @@ Spark SQL
 
 This module provides support for executing relational queries expressed in 
either SQL or the DataFrame/Dataset API.
 
-Spark SQL is broken up into four subprojects:
+Spark SQL is broken up into five subprojects:
+ - API (sql/api) - Includes some public API like DataType, Row, etc. This 
component can be shared between Catalyst and Spark Connect client.

Review Comment:
   Thanks for the documentation!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows [spark]

2024-07-25 Thread via GitHub


agubichev commented on PR #47388:
URL: https://github.com/apache/spark/pull/47388#issuecomment-2250792171

   > To extend it a bit more, shall we allow `where func(T_inner.x) = 
T_outer.date group by T_inner.x` if the `func` guarantees to produce different 
results for different values of `T_inner.x`?
   
   We will look into this as follow-up steps, but ultimately this should be a 
runtime-level check (checking that the join returns at most 1 row), so that we 
can allow all kinds of subqueries and let runtime throw a runtime error for bad 
cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Pyth… [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun opened a new pull request, #47488:
URL: https://github.com/apache/spark/pull/47488

   …on 3.12
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17` to prevent Python 3.12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun opened a new pull request, #47489:
URL: https://github.com/apache/spark/pull/47489

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-49006] Implement purging for OperatorStateMetadataV2 and StateSchemaV3 files [spark]

2024-07-25 Thread via GitHub


ericm-db opened a new pull request, #47490:
URL: https://github.com/apache/spark/pull/47490

   
   
   ### What changes were proposed in this pull request?
   
   Currently, OperatorStateMetadataV2 and StateSchemaV3 files are written for 
every new query run. This PR will implement purging files so we only keep 
`minLogEntriesToMaintain` files per query.
   
   ### Why are the changes needed?
   
   These changes are needed so that we don't indefinitely keep these files 
across many query runs, bounding the number of state files we keep
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Added unit tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47488:
URL: https://github.com/apache/spark/pull/47488#issuecomment-2250942959

   cc @yaooqinn and @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47489:
URL: https://github.com/apache/spark/pull/47489#issuecomment-2250943259

   cc @yaooqinn and @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47488:
URL: https://github.com/apache/spark/pull/47488#issuecomment-2250938296

   cc @yaooqinn and @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47488:
URL: https://github.com/apache/spark/pull/47488#issuecomment-2250955060

   Also, cc @huaxingao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47489:
URL: https://github.com/apache/spark/pull/47489#issuecomment-2250955410

   Also, cc @huaxingao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47488:
URL: https://github.com/apache/spark/pull/47488#issuecomment-2250968487

   Thank you so much, @huaxingao !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47489:
URL: https://github.com/apache/spark/pull/47489#issuecomment-2250972235

   Thank you, @huaxingao !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec [spark]

2024-07-25 Thread via GitHub


gene-db commented on code in PR #47473:
URL: https://github.com/apache/spark/pull/47473#discussion_r1691812112


##
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java:
##
@@ -120,6 +120,12 @@ public class VariantUtil {
   // Long string value. The content is (4-byte little-endian unsigned integer 
representing the
   // string size) + (size bytes of string content).
   public static final int LONG_STR = 16;
+  // year-month interval value. The content is one byte representing the start 
and end field values
+  // (1 bit each starting at least significant bits) and a 4-byte 
little-endian signed integer
+  public static final int YEAR_MONTH_INTERVAL = 19;
+  // day-time interval value. The content is one byte representing the start 
and end field values
+  // (2 bits each starting at least significant bits) and an 8-byte 
little-endian signed integer

Review Comment:
   Similar to the README, we should document what 0, 1, 2, 3 represent for the 
start and end fields.



##
common/variant/src/main/java/org/apache/spark/types/variant/Variant.java:
##
@@ -88,6 +91,16 @@ public long getLong() {
 return VariantUtil.getLong(value, pos);
   }
 
+  // Get the start and end fields of a year-month interval from the variant.
+  public IntervalFields getYearMonthIntervalFields() {

Review Comment:
   What uses these apis?



##
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java:
##
@@ -377,11 +405,52 @@ public static long getLong(byte[] value, int pos) {
   case TIMESTAMP:
   case TIMESTAMP_NTZ:
 return readLong(value, pos + 1, 8);
+  case YEAR_MONTH_INTERVAL:
+return readLong(value, pos + 2, 4);
+  case DAY_TIME_INTERVAL:
+return readLong(value, pos + 2, 8);
   default:
 throw new IllegalStateException(exceptionMessage);
 }
   }
 
+  // Class used to pass around start and end fields of year-month and day-time 
interval values.
+  public static class IntervalFields {
+public IntervalFields(byte startField, byte endField) {
+  this.startField = startField;
+  this.endField = endField;
+}
+
+public final byte startField;
+public final byte endField;
+  }
+
+  // Get the start and end fields of a variant value representing a year-month 
interval value. The
+  // returned array contains the start field at the zeroth index and the end 
field at the first
+  // index.
+  public static IntervalFields getYearMonthIntervalFields(byte[] value, int 
pos) {
+long fieldInfo = readLong(value, pos + 1, 1);
+IntervalFields intervalFields = new IntervalFields((byte) (fieldInfo & 
0x1),
+(byte) ((fieldInfo >> 1) & 0x1));
+if (intervalFields.endField < intervalFields.startField) {

Review Comment:
   Do we need to check the type to be year-month interval, and the length? Some 
other functions call `checkIndex`.



##
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java:
##
@@ -355,9 +380,12 @@ public static boolean getBoolean(byte[] value, int pos) {
 
   // Get a long value from variant value `value[pos...]`.
   // It is only legal to call it if `getType` returns one of 
`Type.LONG/DATE/TIMESTAMP/
-  // TIMESTAMP_NTZ`. If the type is `DATE`, the return value is guaranteed to 
fit into an int and
-  // represents the number of days from the Unix epoch. If the type is 
`TIMESTAMP/TIMESTAMP_NTZ`,
-  // the return value represents the number of microseconds from the Unix 
epoch.
+  // TIMESTAMP_NTZ/YEAR_MONTH_INTERVAL/DAY_TIME_INTERVAL`. If the type is 
`DATE`, the return value
+  // is guaranteed to fit into an int and represents the number of days from 
the Unix epoch.
+  // If the type is `TIMESTAMP/TIMESTAMP_NTZ`, the return value represents the 
number of
+  // microseconds from the Unix epoch. If the type is `YEAR_MONTH_INTERVAL`, 
the return value

Review Comment:
   We should also mention that the year-month one is guaranteed to fit into an 
int.



##
common/variant/src/main/java/org/apache/spark/types/variant/Variant.java:
##
@@ -113,6 +126,11 @@ public String getString() {
 return VariantUtil.getString(value, pos);
   }
 
+  // Get the type info bits from a variant value.
+  public int getTypeInfo() {

Review Comment:
   I can't easily tell, but what uses this api?



##
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java:
##
@@ -120,6 +120,12 @@ public class VariantUtil {
   // Long string value. The content is (4-byte little-endian unsigned integer 
representing the
   // string size) + (size bytes of string content).
   public static final int LONG_STR = 16;
+  // year-month interval value. The content is one byte representing the start 
and end field values
+  // (1 bit each starting at least significant bits) and a 4-byte 
little-endian signed integer

Review Comment:
   Similar to the README, we should document what 0 

Re: [PR] [SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec [spark]

2024-07-25 Thread via GitHub


harshmotw-db commented on code in PR #47473:
URL: https://github.com/apache/spark/pull/47473#discussion_r1691852875


##
common/variant/src/main/java/org/apache/spark/types/variant/Variant.java:
##
@@ -113,6 +126,11 @@ public String getString() {
 return VariantUtil.getString(value, pos);
   }
 
+  // Get the type info bits from a variant value.
+  public int getTypeInfo() {

Review Comment:
   It is used when throwing an `UNKNOWN_PRIMITIVE_TYPE_IN_VARIANT` exception 
from scopes which don't directly have access to the type info.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47488:
URL: https://github.com/apache/spark/pull/47488#issuecomment-2251060362

   Let me merge this to recover the CIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun closed pull request #47488: [SPARK-49005][K8S][3.5] Use 
`17-jammy` tag instead of `17` to prevent Python 12
URL: https://github.com/apache/spark/pull/47488


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47489:
URL: https://github.com/apache/spark/pull/47489#issuecomment-2251061801

   Let me merge this to recover the CIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun closed pull request #47489: [SPARK-49005][K8S][3.4] Use 
`17-jammy` tag instead of `17-jre` to prevent Python 3.12
URL: https://github.com/apache/spark/pull/47489


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47488:
URL: https://github.com/apache/spark/pull/47488#issuecomment-2251069589

   Thank you, @viirya !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec [spark]

2024-07-25 Thread via GitHub


harshmotw-db commented on code in PR #47473:
URL: https://github.com/apache/spark/pull/47473#discussion_r1691939585


##
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java:
##
@@ -377,11 +405,52 @@ public static long getLong(byte[] value, int pos) {
   case TIMESTAMP:
   case TIMESTAMP_NTZ:
 return readLong(value, pos + 1, 8);
+  case YEAR_MONTH_INTERVAL:
+return readLong(value, pos + 2, 4);
+  case DAY_TIME_INTERVAL:
+return readLong(value, pos + 2, 8);
   default:
 throw new IllegalStateException(exceptionMessage);
 }
   }
 
+  // Class used to pass around start and end fields of year-month and day-time 
interval values.
+  public static class IntervalFields {
+public IntervalFields(byte startField, byte endField) {
+  this.startField = startField;
+  this.endField = endField;
+}
+
+public final byte startField;
+public final byte endField;
+  }
+
+  // Get the start and end fields of a variant value representing a year-month 
interval value. The
+  // returned array contains the start field at the zeroth index and the end 
field at the first
+  // index.
+  public static IntervalFields getYearMonthIntervalFields(byte[] value, int 
pos) {
+long fieldInfo = readLong(value, pos + 1, 1);
+IntervalFields intervalFields = new IntervalFields((byte) (fieldInfo & 
0x1),
+(byte) ((fieldInfo >> 1) & 0x1));
+if (intervalFields.endField < intervalFields.startField) {

Review Comment:
   Yes, that would have been a good idea. `checkIndex` is performed by 
`readLong` anyway though. Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec [spark]

2024-07-25 Thread via GitHub


harshmotw-db commented on code in PR #47473:
URL: https://github.com/apache/spark/pull/47473#discussion_r1691942388


##
common/variant/src/main/java/org/apache/spark/types/variant/Variant.java:
##
@@ -88,6 +91,16 @@ public long getLong() {
 return VariantUtil.getLong(value, pos);
   }
 
+  // Get the start and end fields of a year-month interval from the variant.
+  public IntervalFields getYearMonthIntervalFields() {

Review Comment:
   `schemaOf`, `cast from variant` and `to_json`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [wip]Metadata vcf [spark]

2024-07-25 Thread via GitHub


ericm-db closed pull request #47446: [wip]Metadata vcf
URL: https://github.com/apache/spark/pull/47446


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48821][SQL] Support Update in DataFrameWriterV2 [spark]

2024-07-25 Thread via GitHub


szehon-ho commented on PR #47233:
URL: https://github.com/apache/spark/pull/47233#issuecomment-2251363935

   Changed the api to be:
   
   ```
   spark.table(tableNameAsString)
 .update(Map("salary" -> lit(-1)))
 .where($"pk" >= 2)
 .execute()
   ```
   
   Can add option in a follow-up pr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48996][SQL][PYTHON] Allow bare literals for __and__ and __or__ of Column [spark]

2024-07-25 Thread via GitHub


ueshin commented on PR #47474:
URL: https://github.com/apache/spark/pull/47474#issuecomment-2251366183

   The failure seems not related to this PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48996][SQL][PYTHON] Allow bare literals for __and__ and __or__ of Column [spark]

2024-07-25 Thread via GitHub


ueshin commented on PR #47474:
URL: https://github.com/apache/spark/pull/47474#issuecomment-2251366450

   Thanks! merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48996][SQL][PYTHON] Allow bare literals for __and__ and __or__ of Column [spark]

2024-07-25 Thread via GitHub


ueshin closed pull request #47474: [SPARK-48996][SQL][PYTHON] Allow bare 
literals for __and__ and __or__ of Column
URL: https://github.com/apache/spark/pull/47474


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-49007][CORE] Improve `MasterPage` to support custom title [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun opened a new pull request, #47491:
URL: https://github.com/apache/spark/pull/47491

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48628][CORE] Add task peak on/off heap memory metrics [spark]

2024-07-25 Thread via GitHub


liuzqt commented on code in PR #47192:
URL: https://github.com/apache/spark/pull/47192#discussion_r1692144786


##
core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:
##
@@ -110,9 +112,22 @@ class TaskMetrics private[spark] () extends Serializable {
* joins. The value of this accumulator should be approximately the sum of 
the peak sizes
* across all such data structures created in this task. For SQL jobs, this 
only tracks all
* unsafe operators and ExternalSort.
+   * This is not equal to peakOnHeapExecutionMemory + 
peakOffHeapExecutionMemory
*/
+  // TODO: SPARK-48789: the naming is confusing since this does not really 
reflect the whole
+  //  execution memory. We'd better deprecate this once we have a replacement.
   def peakExecutionMemory: Long = _peakExecutionMemory.sum
 
+  /**
+   * Peak on heap execution memory as tracked by TaskMemoryManager.
+   */
+  def peakOnHeapExecutionMemory: Long = _peakOnHeapExecutionMemory.sum
+
+  /**
+   * Peak off heap execution memory as tracked by TaskMemoryManager.
+   */
+  def peakOffHeapExecutionMemory: Long = _peakOffHeapExecutionMemory.sum

Review Comment:
   peakExecutionMemory <= peakOnHeapExecutionMemory + 
peakOffHeapExecutionMemory?
   
   I think yes, because`TaskMemoryManager.acquireExecutionMemory` is the only 
narrow waist for any execution memory acquisition and we maintain the memory 
here.
   
   Instead, the legacy `peakExecutionMemory` is maintained in some operators 
(join, agg, sort), which is totally up to operator implementation.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47489:
URL: https://github.com/apache/spark/pull/47489#issuecomment-2251424159

   Thank you, @viirya !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49007][CORE] Improve `MasterPage` to support custom title [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47491:
URL: https://github.com/apache/spark/pull/47491#issuecomment-2251426017

   Could you review this PR, @huaxingao ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48628][CORE] Add task peak on/off heap memory metrics [spark]

2024-07-25 Thread via GitHub


liuzqt commented on PR #47192:
URL: https://github.com/apache/spark/pull/47192#issuecomment-2251439038

   > Take a look at `peakExecutionMemory` within spark-core. We should be 
exposing the new metrics as part of the api - both at task level, and at stage 
level (distributions for ex).
   
   We should definitely expose this to api. But can we land this core change 
and then make it to API/UI in follow PRs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] Rebased value state [spark]

2024-07-25 Thread via GitHub


jingz-db opened a new pull request, #47492:
URL: https://github.com/apache/spark/pull/47492

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-49008][PYTHON] Use `ParamSpec` to propagate `func` signature in `transform` [spark]

2024-07-25 Thread via GitHub


nicklamiller opened a new pull request, #47493:
URL: https://github.com/apache/spark/pull/47493

   ### What changes were proposed in this pull request?
   Propagate function signature of `func` in `DataFrame(...).transform(...)`.
   
   
   ### Why are the changes needed?
   Propagating the function signature for `func` in 
`DataFrame(...).transform(...)` enables type checkers like `mypy` to enforce 
that `func` is being correctly called through `DataFrame(...).transform(...)`.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Ran example script that passes `add_num` to `transform` with inappropriate 
arguments:
   - first running mypy on `master`, showing that no mypy errors are raised
   - then running mypy with the changes in this PR on the 
`type-hint-transform-args-kwargs` branch
  - this shows that the exptected mypy errors are raised
   
   screenshot of no mypy errors on master
   
   
   https://github.com/user-attachments/assets/a2bb01b2-8ca8-41d6-a50a-fe95d7a1cfd7";>
   

   
   screenshot of expected mypy errors with PR 
changes
   
   
   https://github.com/user-attachments/assets/8d7920cf-e9ad-42a0-b435-1f55b0469f29";>
   

   
   
   example script
   
   
   ```python
   
   from pyspark.sql import DataFrame, functions as F, SparkSession
   
   spark = (
   SparkSession
   .builder
   .appName("Python Spark SQL basic example")
   .getOrCreate()
   )
   
   
   df = spark.createDataFrame([("a", 0), ("b", 1)], schema=["col1", "col2"])
   
   
   def add_num(df: DataFrame, in_colname: str, *, num: int) -> DataFrame:
   return df.withColumn("new_col", F.col(in_colname) + num)
   
   
   if __name__=="__main__":
   df.transform(add_num, "col2", 2).show()   # enforces kw
   df.transform(add_num, in_colname="col2", num="a").show()  # enforces 
type for kwarg
   df.transform(add_num, in_colname=2, num=2).show()   # enforces type 
for arg
   df.transform(add_num, "col2").show()  # enforces 
required args
   ``` 
   
   

   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48628][CORE] Add task peak on/off heap memory metrics [spark]

2024-07-25 Thread via GitHub


JoshRosen commented on code in PR #47192:
URL: https://github.com/apache/spark/pull/47192#discussion_r1692177886


##
core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:
##
@@ -110,9 +112,22 @@ class TaskMetrics private[spark] () extends Serializable {
* joins. The value of this accumulator should be approximately the sum of 
the peak sizes
* across all such data structures created in this task. For SQL jobs, this 
only tracks all
* unsafe operators and ExternalSort.
+   * This is not equal to peakOnHeapExecutionMemory + 
peakOffHeapExecutionMemory
*/
+  // TODO: SPARK-48789: the naming is confusing since this does not really 
reflect the whole
+  //  execution memory. We'd better deprecate this once we have a replacement.
   def peakExecutionMemory: Long = _peakExecutionMemory.sum
 
+  /**
+   * Peak on heap execution memory as tracked by TaskMemoryManager.
+   */
+  def peakOnHeapExecutionMemory: Long = _peakOnHeapExecutionMemory.sum
+
+  /**
+   * Peak off heap execution memory as tracked by TaskMemoryManager.
+   */
+  def peakOffHeapExecutionMemory: Long = _peakOffHeapExecutionMemory.sum

Review Comment:
   +1, I agree that the `peakExecutionMemory <= peakOnHeapExecutionMemory + 
peakOffHeapExecutionMemory` should hold:
   
   If we trace through the existing callers of `incPeakExecutionMemory` it 
looks like all of the usages flow from counts that correspond to the 
acquireExecutionMemory waist.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49007][CORE] Improve `MasterPage` to support custom title [spark]

2024-07-25 Thread via GitHub


dongjoon-hyun commented on PR #47491:
URL: https://github.com/apache/spark/pull/47491#issuecomment-2251459571

   Thank you, @huaxingao !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-25 Thread via GitHub


bogao007 commented on PR #47133:
URL: https://github.com/apache/spark/pull/47133#issuecomment-2251495351

   Hi @HyukjinKwon, I'm getting below testing errors with my PR:
   ```
   ==
   ERROR [0.255s]: test_termination_sigterm 
(pyspark.tests.test_daemon.DaemonTests.test_termination_sigterm)
   Ensure that daemon and workers terminate on SIGTERM.
   --
   Traceback (most recent call last):
 File "/__w/spark/spark/python/pyspark/tests/test_daemon.py", line 77, in 
test_termination_sigterm
   self.do_termination_test(lambda daemon: os.kill(daemon.pid, SIGTERM))
 File "/__w/spark/spark/python/pyspark/tests/test_daemon.py", line 49, in 
do_termination_test
   port = read_int(daemon.stdout)
  ^^^
 File "/__w/spark/spark/python/pyspark/serializers.py", line 597, in 
read_int
   raise EOFError
   EOFError
   ```
   But if I revert changes in `python/pyspark/sql/pandas/group_ops.py` and 
`python/pyspark/sql/streaming/__init__.py`, it would succeed. Do you know if 
this is related to my newly added file `stateful_processor.py`? Not sure why 
this is related though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-49010] Add unit tests for XML schema inference case sensitivity [spark]

2024-07-25 Thread via GitHub


shujingyang-db opened a new pull request, #47494:
URL: https://github.com/apache/spark/pull/47494

   
   
   ### What changes were proposed in this pull request?
   
   Currently, XML respects the case sensitivity SQLConf (default to false) in 
the schema inference but we lack unit tests to verify the behavior. This PR 
adds more unit tests to it.
   
   ### Why are the changes needed?
   
   This is a test-only change.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   This is a test-only change.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-49009][SQL][PYTHON] Make Column APIs and functions accept Enums [spark]

2024-07-25 Thread via GitHub


ueshin opened a new pull request, #47495:
URL: https://github.com/apache/spark/pull/47495

   ### What changes were proposed in this pull request?
   
   Make Column APIs and functions accept `Enum`s.
   
   ### Why are the changes needed?
   
   `Enum`s can be accepted in Column APIs and functions using its `value`.
   
   ```py
   >>> from pyspark.sql import functions as F
   >>> from enum import Enum
   >>> class A(Enum):
   ...   X = "x"
   ...   Y = "y"
   ...
   >>> F.lit(A.X)
   Column<'x'>
   >>> F.lit(A.X) + A.Y
   Column<'`+`(x, y)'>
   ```
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, Python's `Enum`s will be used as literal values.
   
   ### How was this patch tested?
   
   Added the related tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47829][SQL] Text Datasource supports Zstd compression codec [spark]

2024-07-25 Thread via GitHub


github-actions[bot] closed pull request #46026: [SPARK-47829][SQL] Text 
Datasource supports Zstd compression codec
URL: https://github.com/apache/spark/pull/46026


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

2024-07-25 Thread via GitHub


github-actions[bot] commented on PR #45446:
URL: https://github.com/apache/spark/pull/45446#issuecomment-2251630213

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47279][CORE]When the messageLoop encounter a fatal exception, such as oom, exit the JVM to avoid the driver hanging forever [spark]

2024-07-25 Thread via GitHub


github-actions[bot] commented on PR #45385:
URL: https://github.com/apache/spark/pull/45385#issuecomment-2251630277

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47835][NETWORK] Remove switch for remoteReadNioBufferConversion [spark]

2024-07-25 Thread via GitHub


github-actions[bot] closed pull request #46030: [SPARK-47835][NETWORK] Remove 
switch for remoteReadNioBufferConversion
URL: https://github.com/apache/spark/pull/46030


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]

2024-07-25 Thread via GitHub


hvanhovell commented on code in PR #47464:
URL: https://github.com/apache/spark/pull/47464#discussion_r1692311896


##
connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -1893,33 +1855,6 @@ class SparkConnectPlanner(
 val unit = extractString(children(0), "unit")
 Some(TimestampAdd(unit, children(1), children(2)))
 
-  case "window" if Seq(2, 3, 4).contains(fun.getArgumentsCount) =>
-val children = fun.getArgumentsList.asScala.map(transformExpression)
-val timeCol = children.head
-val windowDuration = extractString(children(1), "windowDuration")
-var slideDuration = windowDuration
-if (fun.getArgumentsCount >= 3) {
-  slideDuration = extractString(children(2), "slideDuration")
-}
-var startTime = "0 second"
-if (fun.getArgumentsCount == 4) {
-  startTime = extractString(children(3), "startTime")
-}
-Some(
-  Alias(TimeWindow(timeCol, windowDuration, slideDuration, startTime), 
"window")(
-nonInheritableMetadataKeys = Seq(Dataset.DATASET_ID_KEY, 
Dataset.COL_POS_KEY)))
-
-  case "session_window" if fun.getArgumentsCount == 2 =>
-val children = fun.getArgumentsList.asScala.map(transformExpression)
-val timeCol = children.head
-val sessionWindow = children.last match {
-  case Literal(s, StringType) if s != null => SessionWindow(timeCol, 
s.toString)
-  case other => SessionWindow(timeCol, other)
-}
-Some(
-  Alias(sessionWindow, "session_window")(nonInheritableMetadataKeys =

Review Comment:
   Why do you explicitly remove the `Dataset.DATASET_ID_KEY` and 
`Dataset.COL_POS_KEY`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]

2024-07-25 Thread via GitHub


hvanhovell commented on code in PR #47464:
URL: https://github.com/apache/spark/pull/47464#discussion_r1692314400


##
sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
##
@@ -5705,11 +5693,8 @@ object functions {
* @group datetime_funcs
* @since 3.2.0
*/
-  def session_window(timeColumn: Column, gapDuration: String): Column = {
-withExpr {
-  SessionWindow(timeColumn.expr, gapDuration)
-}.as("session_window")
-  }
+  def session_window(timeColumn: Column, gapDuration: String): Column =
+session_window(timeColumn, lit(gapDuration))

Review Comment:
   This is now supported because of the small change I made to 
ResolveTimeWindows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]

2024-07-25 Thread via GitHub


hvanhovell commented on code in PR #47464:
URL: https://github.com/apache/spark/pull/47464#discussion_r1692315764


##
sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
##
@@ -5743,7 +5728,7 @@ object functions {
* @since 3.2.0
*/
   def session_window(timeColumn: Column, gapDuration: Column): Column =
-Column.fn("session_window", timeColumn, gapDuration).as("session_window")

Review Comment:
   Alias is not needed. This is added by the ResolveTimeWindows rule.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]

2024-07-25 Thread via GitHub


hvanhovell commented on code in PR #47464:
URL: https://github.com/apache/spark/pull/47464#discussion_r1692320832


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveTimeWindows.scala:
##
@@ -257,10 +259,11 @@ object SessionWindowing extends Rule[LogicalPlan] {
   case s: SessionWindow => sessionAttr
 }
 
-val filterByTimeRange = session.gapDuration match {
-  case Literal(interval: CalendarInterval, CalendarIntervalType) =>
-interval == null || interval.months + interval.days + 
interval.microseconds <= 0
-  case _ => true
+val filterByTimeRange = if (gapDuration.foldable) {

Review Comment:
   This is a bit better than pattern matching on a literal since it also 
supports more complex expressions, in particular `Cast(Literal("r", StringType), CalendarIntervalType)`. Normally you'd expect the 
constant folding to take care of this, but this happens during optimization, 
and this rule is part of analysis; hence the need for manual constant folding.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveTimeWindows.scala:
##
@@ -257,10 +259,11 @@ object SessionWindowing extends Rule[LogicalPlan] {
   case s: SessionWindow => sessionAttr
 }
 
-val filterByTimeRange = session.gapDuration match {
-  case Literal(interval: CalendarInterval, CalendarIntervalType) =>
-interval == null || interval.months + interval.days + 
interval.microseconds <= 0
-  case _ => true
+val filterByTimeRange = if (gapDuration.foldable) {

Review Comment:
   This is a bit better than pattern matching on a literal since it supports 
more complex expressions, in particular `Cast(Literal("r", 
StringType), CalendarIntervalType)`. Normally you'd expect the constant folding 
to take care of this, but this happens during optimization, and this rule is 
part of analysis; hence the need for manual constant folding.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]

2024-07-25 Thread via GitHub


hvanhovell commented on code in PR #47464:
URL: https://github.com/apache/spark/pull/47464#discussion_r1692311896


##
connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -1893,33 +1855,6 @@ class SparkConnectPlanner(
 val unit = extractString(children(0), "unit")
 Some(TimestampAdd(unit, children(1), children(2)))
 
-  case "window" if Seq(2, 3, 4).contains(fun.getArgumentsCount) =>
-val children = fun.getArgumentsList.asScala.map(transformExpression)
-val timeCol = children.head
-val windowDuration = extractString(children(1), "windowDuration")
-var slideDuration = windowDuration
-if (fun.getArgumentsCount >= 3) {
-  slideDuration = extractString(children(2), "slideDuration")
-}
-var startTime = "0 second"
-if (fun.getArgumentsCount == 4) {
-  startTime = extractString(children(3), "startTime")
-}
-Some(
-  Alias(TimeWindow(timeCol, windowDuration, slideDuration, startTime), 
"window")(
-nonInheritableMetadataKeys = Seq(Dataset.DATASET_ID_KEY, 
Dataset.COL_POS_KEY)))
-
-  case "session_window" if fun.getArgumentsCount == 2 =>
-val children = fun.getArgumentsList.asScala.map(transformExpression)
-val timeCol = children.head
-val sessionWindow = children.last match {
-  case Literal(s, StringType) if s != null => SessionWindow(timeCol, 
s.toString)
-  case other => SessionWindow(timeCol, other)
-}
-Some(
-  Alias(sessionWindow, "session_window")(nonInheritableMetadataKeys =

Review Comment:
   @zhengruifeng Why do you explicitly remove the `Dataset.DATASET_ID_KEY` and 
`Dataset.COL_POS_KEY`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-25 Thread via GitHub


HyukjinKwon commented on code in PR #47133:
URL: https://github.com/apache/spark/pull/47133#discussion_r1692328985


##
python/pyspark/sql/pandas/group_ops.py:
##
@@ -358,6 +364,140 @@ def applyInPandasWithState(
 )
 return DataFrame(jdf, self.session)
 
+def transformWithStateInPandas(
+self,
+statefulProcessor: StatefulProcessor,
+outputStructType: Union[StructType, str],
+outputMode: str,
+timeMode: str,
+) -> DataFrame:
+"""
+Invokes methods defined in the stateful processor used in arbitrary 
state API v2.
+We allow the user to act on per-group set of input rows along with 
keyed state and the
+user can choose to output/return 0 or more rows.
+
+For a streaming dataframe, we will repeatedly invoke the interface 
methods for new rows
+in each trigger and the user's state/state variables will be stored 
persistently across
+invocations.
+
+The `statefulProcessor` should be a Python class that implements the 
interface defined in
+pyspark.sql.streaming.stateful_processor.StatefulProcessor.
+
+The `outputStructType` should be a :class:`StructType` describing the 
schema of all
+elements in the returned value, `pandas.DataFrame`. The column labels 
of all elements in
+returned `pandas.DataFrame` must either match the field names in the 
defined schema if
+specified as strings, or match the field data types by position if not 
strings,
+e.g. integer indices.
+
+The size of each `pandas.DataFrame` in both the input and output can 
be arbitrary. The
+number of `pandas.DataFrame` in both the input and output can also be 
arbitrary.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+statefulProcessor : 
:class:`pyspark.sql.streaming.stateful_processor.StatefulProcessor`
+Instance of StatefulProcessor whose functions will be invoked by 
the operator.
+outputStructType : :class:`pyspark.sql.types.DataType` or str
+The type of the output records. The value can be either a
+:class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
+outputMode : str
+The output mode of the stateful processor.
+timeMode : str
+The time mode semantics of the stateful processor for timers and 
TTL.
+
+Examples
+
+>>> import pandas as pd
+>>> from pyspark.sql import Row
+>>> from pyspark.sql.streaming import StatefulProcessor, 
StatefulProcessorHandle
+>>> from pyspark.sql.types import StructType, StructField, LongType, 
StringType, IntegerType
+>>> from typing import Iterator
+>>> from pyspark.sql.functions import split, col
+>>> 
spark.conf.set("spark.sql.streaming.stateStore.providerClass","org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider")
+>>> output_schema = StructType([
+... StructField("id", StringType(), True),
+... StructField("count", IntegerType(), True)
+... ])
+>>> state_schema = StructType([
+... StructField("value", IntegerType(), True)
+... ])
+>>> class SimpleStatefulProcessor(StatefulProcessor):
+...   def init(self, handle: StatefulProcessorHandle):

Review Comment:
   4 spaced indentation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-25 Thread via GitHub


HyukjinKwon commented on code in PR #47133:
URL: https://github.com/apache/spark/pull/47133#discussion_r1692326848


##
python/pyspark/sql/streaming/stateful_processor.py:
##
@@ -0,0 +1,180 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from abc import ABC, abstractmethod
+from typing import Any, TYPE_CHECKING, Iterator, Union
+
+from pyspark.sql.streaming.stateful_processor_api_client import 
StatefulProcessorApiClient
+from pyspark.sql.streaming.value_state_client import ValueStateClient
+
+import pandas as pd
+from pyspark.sql.types import (
+StructType,
+StructField,
+IntegerType,
+LongType,
+ShortType,
+FloatType,
+DoubleType,
+DecimalType,
+StringType,
+BooleanType,
+DateType,
+TimestampType,
+)
+
+__all__ = ["StatefulProcessor", "StatefulProcessorHandle"]
+
+if TYPE_CHECKING:
+from pyspark.sql.pandas._typing import DataFrameLike as PandasDataFrameLike
+
+
+class ValueState:
+"""
+Class used for arbitrary stateful operations with the v2 API to capture 
single value state.
+"""
+
+def __init__(
+self, value_state_client: ValueStateClient, state_name: str, schema: 
Union[StructType, str]
+) -> None:
+self._value_state_client = value_state_client
+self._state_name = state_name
+self.schema = schema
+
+def exists(self) -> bool:
+"""
+Whether state exists or not.
+
+.. versionadded:: 4.0.0

Review Comment:
   Adding it at class level docstring should be enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-25 Thread via GitHub


HyukjinKwon commented on code in PR #47133:
URL: https://github.com/apache/spark/pull/47133#discussion_r1692329891


##
python/pyspark/sql/pandas/group_ops.py:
##
@@ -358,6 +364,140 @@ def applyInPandasWithState(
 )
 return DataFrame(jdf, self.session)
 
+def transformWithStateInPandas(
+self,
+statefulProcessor: StatefulProcessor,
+outputStructType: Union[StructType, str],
+outputMode: str,
+timeMode: str,
+) -> DataFrame:
+"""
+Invokes methods defined in the stateful processor used in arbitrary 
state API v2.
+We allow the user to act on per-group set of input rows along with 
keyed state and the
+user can choose to output/return 0 or more rows.
+
+For a streaming dataframe, we will repeatedly invoke the interface 
methods for new rows
+in each trigger and the user's state/state variables will be stored 
persistently across
+invocations.
+
+The `statefulProcessor` should be a Python class that implements the 
interface defined in
+pyspark.sql.streaming.stateful_processor.StatefulProcessor.
+
+The `outputStructType` should be a :class:`StructType` describing the 
schema of all
+elements in the returned value, `pandas.DataFrame`. The column labels 
of all elements in
+returned `pandas.DataFrame` must either match the field names in the 
defined schema if
+specified as strings, or match the field data types by position if not 
strings,
+e.g. integer indices.
+
+The size of each `pandas.DataFrame` in both the input and output can 
be arbitrary. The
+number of `pandas.DataFrame` in both the input and output can also be 
arbitrary.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+statefulProcessor : 
:class:`pyspark.sql.streaming.stateful_processor.StatefulProcessor`
+Instance of StatefulProcessor whose functions will be invoked by 
the operator.
+outputStructType : :class:`pyspark.sql.types.DataType` or str
+The type of the output records. The value can be either a
+:class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
+outputMode : str
+The output mode of the stateful processor.
+timeMode : str
+The time mode semantics of the stateful processor for timers and 
TTL.
+
+Examples
+
+>>> import pandas as pd
+>>> from pyspark.sql import Row
+>>> from pyspark.sql.streaming import StatefulProcessor, 
StatefulProcessorHandle
+>>> from pyspark.sql.types import StructType, StructField, LongType, 
StringType, IntegerType
+>>> from typing import Iterator
+>>> from pyspark.sql.functions import split, col

Review Comment:
   let's fix the import according to PEP 8 import order



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48829][BUILD] Upgrade `RoaringBitmap` to 1.2.1 [spark]

2024-07-25 Thread via GitHub


panbingkun commented on PR #47247:
URL: https://github.com/apache/spark/pull/47247#issuecomment-2251747973

   > ready to go?
   
   Yea, ready for review, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]

2024-07-25 Thread via GitHub


hvanhovell commented on code in PR #47464:
URL: https://github.com/apache/spark/pull/47464#discussion_r1692345753


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala:
##
@@ -767,6 +771,7 @@ object FunctionRegistry {
 expression[EqualNull]("equal_null"),
 expression[HllSketchEstimate]("hll_sketch_estimate"),
 expression[HllUnion]("hll_union"),
+expression[UnwrapUDT]("unwrap_udt"),

Review Comment:
   Moved this out.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48985][CONNECT] Connect Compatible Expression Constructors [spark]

2024-07-25 Thread via GitHub


hvanhovell commented on code in PR #47464:
URL: https://github.com/apache/spark/pull/47464#discussion_r1692345905


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala:
##
@@ -519,6 +519,8 @@ object FunctionRegistry {
 expressionBuilder("mode", ModeBuilder),
 expression[HllSketchAgg]("hll_sketch_agg"),
 expression[HllUnionAgg]("hll_union_agg"),
+expression[Product]("product"),
+expression[CollectTopK]("collect_top_k"),

Review Comment:
   Moved this out.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [ONLY TEST][HOLD] Upgrade rocksdbjni to 9.4.0 [spark]

2024-07-25 Thread via GitHub


panbingkun commented on PR #47207:
URL: https://github.com/apache/spark/pull/47207#issuecomment-2251754892

   > Has there been any new progress on this one
   
   Let's wait a little longer, I think version 9.5 should be released soon
   https://github.com/user-attachments/assets/c31f7463-8222-4842-9359-eb67ca4c9cec";>
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48755] State V2 base implementation and ValueState support [spark]

2024-07-25 Thread via GitHub


HyukjinKwon commented on PR #47133:
URL: https://github.com/apache/spark/pull/47133#issuecomment-2251778289

   The test failure doesn't look related to me. Can you reproduce it locally?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49006] Implement purging for OperatorStateMetadataV2 and StateSchemaV3 files [spark]

2024-07-25 Thread via GitHub


anishshri-db commented on code in PR #47490:
URL: https://github.com/apache/spark/pull/47490#discussion_r1692363879


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala:
##
@@ -79,6 +79,38 @@ class IncrementalExecution(
   StreamingTransformWithStateStrategy :: Nil
   }
 
+  // Methods to enable the use of AsyncLogPurge
+  protected val minLogEntriesToMaintain: Int =
+sparkSession.sessionState.conf.minBatchesToRetain
+
+  val errorNotifier: ErrorNotifier = new ErrorNotifier()
+
+  override protected def purge(threshold: Long): Unit = {}
+
+  override protected def purgeOldest(planWithStateOpId: SparkPlan): Unit = {
+planWithStateOpId.collect {
+  case ssw: StateStoreWriter =>
+ssw.operatorStateMetadataVersion match {
+  case 2 =>
+val fileManager = new OperatorStateMetadataV2FileManager(
+  new Path(checkpointLocation, 
ssw.getStateInfo.operatorId.toString),
+  ssw.stateSchemaFilePath(Some(StateStoreId.DEFAULT_STORE_NAME)),
+  hadoopConf)
+  fileManager.keepNEntries(minLogEntriesToMaintain)
+  case _ =>
+}
+  case _ =>
+}
+  }
+
+  private def purgeMetadataFiles(planWithStateOpId: SparkPlan): Unit = {
+if (useAsyncPurge) {

Review Comment:
   We can just always do async right ? is there a reason to also do sync ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49006] Implement purging for OperatorStateMetadataV2 and StateSchemaV3 files [spark]

2024-07-25 Thread via GitHub


anishshri-db commented on code in PR #47490:
URL: https://github.com/apache/spark/pull/47490#discussion_r1692364074


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/OperatorStateMetadata.scala:
##
@@ -313,3 +314,89 @@ class OperatorStateMetadataV2Reader(
 OperatorStateMetadataUtils.readMetadata(inputStream)
   }
 }
+
+class OperatorStateMetadataV2FileManager(

Review Comment:
   Can we add a comment for this class ? what are we using this for exactly ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49006] Implement purging for OperatorStateMetadataV2 and StateSchemaV3 files [spark]

2024-07-25 Thread via GitHub


anishshri-db commented on code in PR #47490:
URL: https://github.com/apache/spark/pull/47490#discussion_r1692364600


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala:
##
@@ -79,6 +79,38 @@ class IncrementalExecution(
   StreamingTransformWithStateStrategy :: Nil
   }
 
+  // Methods to enable the use of AsyncLogPurge
+  protected val minLogEntriesToMaintain: Int =
+sparkSession.sessionState.conf.minBatchesToRetain
+
+  val errorNotifier: ErrorNotifier = new ErrorNotifier()
+
+  override protected def purge(threshold: Long): Unit = {}
+
+  override protected def purgeOldest(planWithStateOpId: SparkPlan): Unit = {
+planWithStateOpId.collect {
+  case ssw: StateStoreWriter =>
+ssw.operatorStateMetadataVersion match {
+  case 2 =>
+val fileManager = new OperatorStateMetadataV2FileManager(
+  new Path(checkpointLocation, 
ssw.getStateInfo.operatorId.toString),
+  ssw.stateSchemaFilePath(Some(StateStoreId.DEFAULT_STORE_NAME)),
+  hadoopConf)
+  fileManager.keepNEntries(minLogEntriesToMaintain)

Review Comment:
   The logic here might not be very straight forward. We need to keep the 
metadata for the earliest retained state basically



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows [spark]

2024-07-25 Thread via GitHub


cloud-fan commented on PR #47388:
URL: https://github.com/apache/spark/pull/47388#issuecomment-2251791506

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows [spark]

2024-07-25 Thread via GitHub


cloud-fan closed pull request #47388: [SPARK-48503][SQL] Allow grouping on 
expressions in scalar subqueries, if they are bound to outer rows
URL: https://github.com/apache/spark/pull/47388


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45787][SQL] Support Catalog.listColumns for clustering columns [spark]

2024-07-25 Thread via GitHub


cloud-fan commented on PR #47451:
URL: https://github.com/apache/spark/pull/47451#issuecomment-2251797510

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45787][SQL] Support Catalog.listColumns for clustering columns [spark]

2024-07-25 Thread via GitHub


cloud-fan closed pull request #47451: [SPARK-45787][SQL] Support 
Catalog.listColumns for clustering columns
URL: https://github.com/apache/spark/pull/47451


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.5] Use `17-jammy` tag instead of `17` to prevent Python 12 [spark]

2024-07-25 Thread via GitHub


yaooqinn commented on PR #47488:
URL: https://github.com/apache/spark/pull/47488#issuecomment-2251851725

   Late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49005][K8S][3.4] Use `17-jammy` tag instead of `17-jre` to prevent Python 3.12 [spark]

2024-07-25 Thread via GitHub


yaooqinn commented on PR #47489:
URL: https://github.com/apache/spark/pull/47489#issuecomment-2251851929

   Late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >