[jira] [Updated] (SPARK-51252) Adding state store level metrics for last uploaded snapshot version in HDFS State Stores

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51252:
---
Labels: pull-request-available  (was: )

> Adding state store level metrics for last uploaded snapshot version in HDFS 
> State Stores
> 
>
> Key: SPARK-51252
> URL: https://issues.apache.org/jira/browse/SPARK-51252
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0, 4.1.0
>Reporter: Zeyu Chen
>Priority: Minor
>  Labels: pull-request-available
>
> Similarly to SPARK-51097,  we would also like to introduce a similar level of 
> observability to HDFSBackedStateStore.
> The introduction of state store "instance" metrics to StreamingQueryProgress 
> to track the latest snapshot version uploaded in HDFS state stores should 
> address three challenges in observability:
>  * Uneven partition starvation, where we need to identify partitions with 
> slow state maintenance,
>  * Finding missing snapshots across versions, so we minimize extensive 
> replays during recovery,
>  * Identify performance instability, such as gaining insights into snapshot 
> upload patterns
> The instance metrics should be kept as generalized as possible, so that 
> future instance metrics for observability can be added with minimal 
> refactoring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51274) PySparkLogger should respect the expected keyword arguments.

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51274:
---
Labels: pull-request-available  (was: )

> PySparkLogger should respect the expected keyword arguments.
> 
>
> Key: SPARK-51274
> URL: https://issues.apache.org/jira/browse/SPARK-51274
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51273) Spark Connect Call Procedure runs the procedure twice

2025-02-20 Thread Szehon Ho (Jira)
Szehon Ho created SPARK-51273:
-

 Summary: Spark Connect Call Procedure runs the procedure twice
 Key: SPARK-51273
 URL: https://issues.apache.org/jira/browse/SPARK-51273
 Project: Spark
  Issue Type: Bug
  Components: Connect, SQL
Affects Versions: 4.0.0
Reporter: Szehon Ho


Running 'call procedure' via Spark connect results in the procedure getting 
called twice.

 

This is because the 
org.apache.spark.sql.connect.SparkSession.sql sends the plan over to be 
evaluated, and that invokes it once.
 
This returns a org.apache.spark.sql.connect.DataSet, and then running 
df.collect() sends the plan to be evaluated, invoking it a second time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51274) PySparkLogger should respect the expected keyword arguments.

2025-02-20 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-51274:
-

 Summary: PySparkLogger should respect the expected keyword 
arguments.
 Key: SPARK-51274
 URL: https://issues.apache.org/jira/browse/SPARK-51274
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51185) Revert simplifications to PartitionedFileUtil API due to increased risk of OOM

2025-02-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-51185:

Fix Version/s: 3.5.5

> Revert simplifications to PartitionedFileUtil API due to increased risk of OOM
> --
>
> Key: SPARK-51185
> URL: https://issues.apache.org/jira/browse/SPARK-51185
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Lukas Rupprecht
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.5
>
>
> A prior ticket (see https://issues.apache.org/jira/browse/SPARK-44081) 
> implemented a simplification to the PartitionedFileUtils API that removed 
> redundant `path` arguments, which can be obtained directly from the 
> `FileStatusWithMetadata` object. To avoid repeatedly having to compute the 
> path, this change changed `FileStatusWithMetadata.getPath` from a def to a 
> lazy val. However, this increases the memory requirements as the path now 
> needs to be kept in memory throughout the lifetime of a 
> `FileStatusWithMetadata` object. This can lead to OOMs so we should revert 
> this change and make `getPath` to be a `def`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51272) Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key

2025-02-20 Thread Asif (Jira)
Asif created SPARK-51272:


 Summary: Race condition in DagScheduler can result in failure of 
retrying all partitions for non deterministic partitioning key
 Key: SPARK-51272
 URL: https://issues.apache.org/jira/browse/SPARK-51272
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.4
Reporter: Asif
 Fix For: 4.0.0


In DagScheduler, where a successful task completion occurs concurrently with a 
task failure , for an inDeterminate stage, results in a situation , where 
instead of re-executing all partitions, only some are retried. This results in 
data loss.
The race condition identified is as follows:
a) A successful result stage task, is yet to mark in the boolean array tracking 
partitions success/failure as true/false.
b) A concurrent failed result task, belonging to an InDeterminate stage, 
idenitfies all the stages which needs/ can be rolled back. For Result Stage, it 
looks into the array of successful partitions. As none is marked as true, the 
ResultStage and dependent stages are delegated to thread pool for retry.
c) Between the time of collecting stages to rollback and re-try of stages, the 
successful task marks boolean as true.
d) The Retry of Stage, as a result, misses the partition marked as successful, 
for retry.

 

Attaching two files for reproducing the functional bug , showing the race 
condition causing data corruption.

I am attaching 2 files for bug test
 # bugrepro.patch
This is needed to coax the single VM test to reproduce the issue. It has lots 
of interception and tweaks to ensure that system is able to hit the data loss 
situation.
( like each partition writes only a shuffle file containing keys evaluating to 
same hashCode and deleting the shuffle file at right time etc)
 # The BugTest itself.

a) If the bugrepro.patch is applied to current master and the BugTest run, it 
will fail immediately with assertion failure where instead of 12 rows, 6 rows 
show up in result.

b) If the bugrepro.patch is applied on top of PR 
[https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029]
  , then the BugTest will fail after one or two or more iterations, indicating 
the race condition in DataScheduler/Stage interaction.

c) But if the same BugTest is run on branch containing fix for this bug as well 
as the PR 
[https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029],
 it will pass in all the 100 iteration.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51272) Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key

2025-02-20 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-51272:
-
Attachment: BugTest.txt

> Race condition in DagScheduler can result in failure of retrying all 
> partitions for non deterministic partitioning key
> --
>
> Key: SPARK-51272
> URL: https://issues.apache.org/jira/browse/SPARK-51272
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.4
>Reporter: Asif
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: BugTest.txt
>
>
> In DagScheduler, where a successful task completion occurs concurrently with 
> a task failure , for an inDeterminate stage, results in a situation , where 
> instead of re-executing all partitions, only some are retried. This results 
> in data loss.
> The race condition identified is as follows:
> a) A successful result stage task, is yet to mark in the boolean array 
> tracking partitions success/failure as true/false.
> b) A concurrent failed result task, belonging to an InDeterminate stage, 
> idenitfies all the stages which needs/ can be rolled back. For Result Stage, 
> it looks into the array of successful partitions. As none is marked as true, 
> the ResultStage and dependent stages are delegated to thread pool for retry.
> c) Between the time of collecting stages to rollback and re-try of stages, 
> the successful task marks boolean as true.
> d) The Retry of Stage, as a result, misses the partition marked as 
> successful, for retry.
>  
> Attaching two files for reproducing the functional bug , showing the race 
> condition causing data corruption.
> I am attaching 2 files for bug test
>  # bugrepro.patch
> This is needed to coax the single VM test to reproduce the issue. It has lots 
> of interception and tweaks to ensure that system is able to hit the data loss 
> situation.
> ( like each partition writes only a shuffle file containing keys evaluating 
> to same hashCode and deleting the shuffle file at right time etc)
>  # The BugTest itself.
> a) If the bugrepro.patch is applied to current master and the BugTest run, it 
> will fail immediately with assertion failure where instead of 12 rows, 6 rows 
> show up in result.
> b) If the bugrepro.patch is applied on top of PR 
> [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029]
>   , then the BugTest will fail after one or two or more iterations, 
> indicating the race condition in DataScheduler/Stage interaction.
> c) But if the same BugTest is run on branch containing fix for this bug as 
> well as the PR 
> [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029],
>  it will pass in all the 100 iteration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-51272) Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key

2025-02-20 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-51272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928944#comment-17928944
 ] 

Asif commented on SPARK-51272:
--

 [^BugTest.txt]  [^bugrepro.patch] 

> Race condition in DagScheduler can result in failure of retrying all 
> partitions for non deterministic partitioning key
> --
>
> Key: SPARK-51272
> URL: https://issues.apache.org/jira/browse/SPARK-51272
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.4
>Reporter: Asif
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: BugTest.txt, bugrepro.patch
>
>
> In DagScheduler, where a successful task completion occurs concurrently with 
> a task failure , for an inDeterminate stage, results in a situation , where 
> instead of re-executing all partitions, only some are retried. This results 
> in data loss.
> The race condition identified is as follows:
> a) A successful result stage task, is yet to mark in the boolean array 
> tracking partitions success/failure as true/false.
> b) A concurrent failed result task, belonging to an InDeterminate stage, 
> idenitfies all the stages which needs/ can be rolled back. For Result Stage, 
> it looks into the array of successful partitions. As none is marked as true, 
> the ResultStage and dependent stages are delegated to thread pool for retry.
> c) Between the time of collecting stages to rollback and re-try of stages, 
> the successful task marks boolean as true.
> d) The Retry of Stage, as a result, misses the partition marked as 
> successful, for retry.
>  
> Attaching two files for reproducing the functional bug , showing the race 
> condition causing data corruption.
> I am attaching 2 files for bug test
>  # bugrepro.patch
> This is needed to coax the single VM test to reproduce the issue. It has lots 
> of interception and tweaks to ensure that system is able to hit the data loss 
> situation.
> ( like each partition writes only a shuffle file containing keys evaluating 
> to same hashCode and deleting the shuffle file at right time etc)
>  # The BugTest itself.
> a) If the bugrepro.patch is applied to current master and the BugTest run, it 
> will fail immediately with assertion failure where instead of 12 rows, 6 rows 
> show up in result.
> b) If the bugrepro.patch is applied on top of PR 
> [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029]
>   , then the BugTest will fail after one or two or more iterations, 
> indicating the race condition in DataScheduler/Stage interaction.
> c) But if the same BugTest is run on branch containing fix for this bug as 
> well as the PR 
> [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029],
>  it will pass in all the 100 iteration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51272) Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key

2025-02-20 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-51272:
-
Labels: spark-core  (was: )

> Race condition in DagScheduler can result in failure of retrying all 
> partitions for non deterministic partitioning key
> --
>
> Key: SPARK-51272
> URL: https://issues.apache.org/jira/browse/SPARK-51272
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.4
>Reporter: Asif
>Priority: Major
>  Labels: spark-core
> Fix For: 4.0.0
>
> Attachments: BugTest.txt, bugrepro.patch
>
>
> In DagScheduler, where a successful task completion occurs concurrently with 
> a task failure , for an inDeterminate stage, results in a situation , where 
> instead of re-executing all partitions, only some are retried. This results 
> in data loss.
> The race condition identified is as follows:
> a) A successful result stage task, is yet to mark in the boolean array 
> tracking partitions success/failure as true/false.
> b) A concurrent failed result task, belonging to an InDeterminate stage, 
> idenitfies all the stages which needs/ can be rolled back. For Result Stage, 
> it looks into the array of successful partitions. As none is marked as true, 
> the ResultStage and dependent stages are delegated to thread pool for retry.
> c) Between the time of collecting stages to rollback and re-try of stages, 
> the successful task marks boolean as true.
> d) The Retry of Stage, as a result, misses the partition marked as 
> successful, for retry.
>  
> Attaching two files for reproducing the functional bug , showing the race 
> condition causing data corruption.
> I am attaching 2 files for bug test
>  # bugrepro.patch
> This is needed to coax the single VM test to reproduce the issue. It has lots 
> of interception and tweaks to ensure that system is able to hit the data loss 
> situation.
> ( like each partition writes only a shuffle file containing keys evaluating 
> to same hashCode and deleting the shuffle file at right time etc)
>  # The BugTest itself.
> a) If the bugrepro.patch is applied to current master and the BugTest run, it 
> will fail immediately with assertion failure where instead of 12 rows, 6 rows 
> show up in result.
> b) If the bugrepro.patch is applied on top of PR 
> [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029]
>   , then the BugTest will fail after one or two or more iterations, 
> indicating the race condition in DataScheduler/Stage interaction.
> c) But if the same BugTest is run on branch containing fix for this bug as 
> well as the PR 
> [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029],
>  it will pass in all the 100 iteration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51272) Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key

2025-02-20 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-51272:
-
Attachment: bugrepro.patch

> Race condition in DagScheduler can result in failure of retrying all 
> partitions for non deterministic partitioning key
> --
>
> Key: SPARK-51272
> URL: https://issues.apache.org/jira/browse/SPARK-51272
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.4
>Reporter: Asif
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: BugTest.txt, bugrepro.patch
>
>
> In DagScheduler, where a successful task completion occurs concurrently with 
> a task failure , for an inDeterminate stage, results in a situation , where 
> instead of re-executing all partitions, only some are retried. This results 
> in data loss.
> The race condition identified is as follows:
> a) A successful result stage task, is yet to mark in the boolean array 
> tracking partitions success/failure as true/false.
> b) A concurrent failed result task, belonging to an InDeterminate stage, 
> idenitfies all the stages which needs/ can be rolled back. For Result Stage, 
> it looks into the array of successful partitions. As none is marked as true, 
> the ResultStage and dependent stages are delegated to thread pool for retry.
> c) Between the time of collecting stages to rollback and re-try of stages, 
> the successful task marks boolean as true.
> d) The Retry of Stage, as a result, misses the partition marked as 
> successful, for retry.
>  
> Attaching two files for reproducing the functional bug , showing the race 
> condition causing data corruption.
> I am attaching 2 files for bug test
>  # bugrepro.patch
> This is needed to coax the single VM test to reproduce the issue. It has lots 
> of interception and tweaks to ensure that system is able to hit the data loss 
> situation.
> ( like each partition writes only a shuffle file containing keys evaluating 
> to same hashCode and deleting the shuffle file at right time etc)
>  # The BugTest itself.
> a) If the bugrepro.patch is applied to current master and the BugTest run, it 
> will fail immediately with assertion failure where instead of 12 rows, 6 rows 
> show up in result.
> b) If the bugrepro.patch is applied on top of PR 
> [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029]
>   , then the BugTest will fail after one or two or more iterations, 
> indicating the race condition in DataScheduler/Stage interaction.
> c) But if the same BugTest is run on branch containing fix for this bug as 
> well as the PR 
> [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029],
>  it will pass in all the 100 iteration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-51275) Session propagation in python readwrite

2025-02-20 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-51275:
-

Assignee: Ruifeng Zheng

> Session propagation in python readwrite
> ---
>
> Key: SPARK-51275
> URL: https://issues.apache.org/jira/browse/SPARK-51275
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 4.1
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51262) exceptAll not working with drop_duplicates using subset

2025-02-20 Thread Nicolau Balbino (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolau Balbino updated SPARK-51262:

Description: 
When using drop_duplicate with subset and after use exceptAll method, when 
calling some action (isEmpty, show, collect, count) raises a Py4J error. 
Searching web, this issues is related here: 
[https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-39612,] also 
marked as resolved.

I tested locally with version 3.5.3 and also AWS Glue 5.0, using 3.5.

  was:
When using drop_duplicate with subset and after uses exceptAll method, when 
calling some action (isEmpty, show, collect, count) raises a Py4J error. 
Searching web, this issues is related here: 
[https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-39612,] also 
marked as resolved.

I tested locally with version 3.5.3 and also AWS Glue 5.0, using 3.5.


> exceptAll not working with drop_duplicates using subset
> ---
>
> Key: SPARK-51262
> URL: https://issues.apache.org/jira/browse/SPARK-51262
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0, 3.5.3
>Reporter: Nicolau Balbino
>Priority: Minor
>  Labels: SQL
>
> When using drop_duplicate with subset and after use exceptAll method, when 
> calling some action (isEmpty, show, collect, count) raises a Py4J error. 
> Searching web, this issues is related here: 
> [https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-39612,] 
> also marked as resolved.
> I tested locally with version 3.5.3 and also AWS Glue 5.0, using 3.5.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51273) Spark Connect Call Procedure runs the procedure twice

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51273:
---
Labels: pull-request-available  (was: )

> Spark Connect Call Procedure runs the procedure twice
> -
>
> Key: SPARK-51273
> URL: https://issues.apache.org/jira/browse/SPARK-51273
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
>
> Running 'call procedure' via Spark connect results in the procedure getting 
> called twice.
>  
> This is because the 
> org.apache.spark.sql.connect.SparkSession.sql sends the plan over to be 
> evaluated, and that invokes it once.
>  
> This returns a org.apache.spark.sql.connect.DataSet, and then running 
> df.collect() sends the plan to be evaluated, invoking it a second time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-50864) Optimize and reeanble slow pytorch tests

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-50864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-50864:
---
Labels: pull-request-available  (was: )

> Optimize and reeanble slow pytorch tests
> 
>
> Key: SPARK-50864
> URL: https://issues.apache.org/jira/browse/SPARK-50864
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark, Tests
>Affects Versions: 4.0.0, 4.1
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51278) Use appropriate structure of JSON format for PySparkLogger

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51278:
---
Labels: pull-request-available  (was: )

> Use appropriate structure of JSON format for PySparkLogger
> --
>
> Key: SPARK-51278
> URL: https://issues.apache.org/jira/browse/SPARK-51278
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.1.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-51249) Fix version encoding bug in NoPrefixKeyStateEncoder

2025-02-20 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-51249.
--
Resolution: Fixed

Issue resolved by pull request 49996
[https://github.com/apache/spark/pull/49996]

> Fix version encoding bug in NoPrefixKeyStateEncoder
> ---
>
> Key: SPARK-51249
> URL: https://issues.apache.org/jira/browse/SPARK-51249
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 4.0.0, 4.1
>Reporter: Eric Marnadi
>Assignee: Eric Marnadi
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, we add two version bytes to the NoPrefixKeyStateEncoder when 
> column families is enabled. However, this is unintended, and we only want to 
> add one version byte. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-51249) Fix version encoding bug in NoPrefixKeyStateEncoder

2025-02-20 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-51249:


Assignee: Eric Marnadi

> Fix version encoding bug in NoPrefixKeyStateEncoder
> ---
>
> Key: SPARK-51249
> URL: https://issues.apache.org/jira/browse/SPARK-51249
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 4.0.0, 4.1
>Reporter: Eric Marnadi
>Assignee: Eric Marnadi
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, we add two version bytes to the NoPrefixKeyStateEncoder when 
> column families is enabled. However, this is unintended, and we only want to 
> add one version byte. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-50864) Optimize and reeanble slow pytorch tests

2025-02-20 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-50864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-50864:
--
Affects Version/s: 4.1

> Optimize and reeanble slow pytorch tests
> 
>
> Key: SPARK-50864
> URL: https://issues.apache.org/jira/browse/SPARK-50864
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark, Tests
>Affects Versions: 4.0.0, 4.1
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51278) Use appropriate structure of JSON format for PySparkLogger

2025-02-20 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-51278:
---

 Summary: Use appropriate structure of JSON format for PySparkLogger
 Key: SPARK-51278
 URL: https://issues.apache.org/jira/browse/SPARK-51278
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.1.0
Reporter: Haejoon Lee






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51272) Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51272:
---
Labels: pull-request-available spark-core  (was: spark-core)

> Race condition in DagScheduler can result in failure of retrying all 
> partitions for non deterministic partitioning key
> --
>
> Key: SPARK-51272
> URL: https://issues.apache.org/jira/browse/SPARK-51272
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.4
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available, spark-core
> Fix For: 4.0.0
>
> Attachments: BugTest.txt, bugrepro.patch
>
>
> In DagScheduler, where a successful task completion occurs concurrently with 
> a task failure , for an inDeterminate stage, results in a situation , where 
> instead of re-executing all partitions, only some are retried. This results 
> in data loss.
> The race condition identified is as follows:
> a) A successful result stage task, is yet to mark in the boolean array 
> tracking partitions success/failure as true/false.
> b) A concurrent failed result task, belonging to an InDeterminate stage, 
> idenitfies all the stages which needs/ can be rolled back. For Result Stage, 
> it looks into the array of successful partitions. As none is marked as true, 
> the ResultStage and dependent stages are delegated to thread pool for retry.
> c) Between the time of collecting stages to rollback and re-try of stages, 
> the successful task marks boolean as true.
> d) The Retry of Stage, as a result, misses the partition marked as 
> successful, for retry.
>  
> Attaching two files for reproducing the functional bug , showing the race 
> condition causing data corruption.
> I am attaching 2 files for bug test
>  # bugrepro.patch
> This is needed to coax the single VM test to reproduce the issue. It has lots 
> of interception and tweaks to ensure that system is able to hit the data loss 
> situation.
> ( like each partition writes only a shuffle file containing keys evaluating 
> to same hashCode and deleting the shuffle file at right time etc)
>  # The BugTest itself.
> a) If the bugrepro.patch is applied to current master and the BugTest run, it 
> will fail immediately with assertion failure where instead of 12 rows, 6 rows 
> show up in result.
> b) If the bugrepro.patch is applied on top of PR 
> [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029]
>   , then the BugTest will fail after one or two or more iterations, 
> indicating the race condition in DataScheduler/Stage interaction.
> c) But if the same BugTest is run on branch containing fix for this bug as 
> well as the PR 
> [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029],
>  it will pass in all the 100 iteration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51276) Enable spark.sql.execution.arrow.pyspark.enabled by default

2025-02-20 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-51276:


 Summary: Enable spark.sql.execution.arrow.pyspark.enabled by 
default
 Key: SPARK-51276
 URL: https://issues.apache.org/jira/browse/SPARK-51276
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


spark.sql.execution.arrow.pyspark.enabled has been there few years, and this is 
stable. We can enable it by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51276) Enable spark.sql.execution.arrow.pyspark.enabled by default

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51276:
---
Labels: pull-request-available  (was: )

> Enable spark.sql.execution.arrow.pyspark.enabled by default
> ---
>
> Key: SPARK-51276
> URL: https://issues.apache.org/jira/browse/SPARK-51276
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> spark.sql.execution.arrow.pyspark.enabled has been there few years, and this 
> is stable. We can enable it by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51276) Enable spark.sql.execution.arrow.pyspark.enabled by default

2025-02-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-51276:
-
Labels: release-notes  (was: pull-request-available)

> Enable spark.sql.execution.arrow.pyspark.enabled by default
> ---
>
> Key: SPARK-51276
> URL: https://issues.apache.org/jira/browse/SPARK-51276
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: release-notes
>
> spark.sql.execution.arrow.pyspark.enabled has been there few years, and this 
> is stable. We can enable it by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48516) Turn on Arrow optimization for Python UDFs by default

2025-02-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-48516:
-
Labels: pull-request-available release-notes  (was: pull-request-available)

> Turn on Arrow optimization for Python UDFs by default
> -
>
> Key: SPARK-48516
> URL: https://issues.apache.org/jira/browse/SPARK-48516
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>  Labels: pull-request-available, release-notes
> Fix For: 4.0.0
>
>
> Turn on Arrow optimization for Python UDFs by default



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51277) Implement 0-arg implementation in Arrow-optimized Python UDF

2025-02-20 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-51277:


 Summary: Implement 0-arg implementation in Arrow-optimized Python 
UDF
 Key: SPARK-51277
 URL: https://issues.apache.org/jira/browse/SPARK-51277
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


The reason why 0-arg with pandas UDFs are not implemented is that you don't 
know how many to return but in Arrow-optimized Python UDFs we know. so we can 
implement this 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-51016) result data compromised in case of indeterministic join keys in Outer Join op, when retry happens

2025-02-20 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-51016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17924245#comment-17924245
 ] 

Asif edited comment on SPARK-51016 at 2/20/25 11:20 PM:


Pull Request:

[https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029]


was (Author: ashahid7):
Pull Request:

[https://github.com/apache/spark/pull/49708|https://github.com/apache/spark/pull/49708]

> result data compromised in case of indeterministic join keys in Outer Join 
> op, when retry happens
> -
>
> Key: SPARK-51016
> URL: https://issues.apache.org/jira/browse/SPARK-51016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.4
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available, spark-sql
>
> For a query like:
> {quote}val outerDf = spark.createDataset(
> Seq((1L, "aa"), (null, "aa"), (2L, "bb"), (null, "bb"), (3L, "cc"), (null, 
> "cc")))(
> Encoders.tupleEncoder(Encoders.LONG, Encoders.STRING)).toDF("pkLeftt", 
> "strleft")
> val innerDf = spark.createDataset(
> Seq((1L, "11"), (2L, "22"), (3L, "33")))(
> Encoders.tupleEncoder(Encoders.LONG, Encoders.STRING)).toDF("pkRight", 
> "strright")
> val leftOuter = outerDf.select(
> col("strleft"), when(isnull(col("pkLeftt")), floor(rand() * 
> Literal(1000L)).
> cast(LongType)).
> otherwise(col("pkLeftt")).as("pkLeft"))
> val outerjoin = leftOuter.hint("shuffle_hash").
> join(innerDf, col("pkLeft") === col("pkRight"), "left_outer")
> {quote}
>  
> where an arbitrary long value is assigned to the left table's joining key , 
> if it is null, so that skew is avoided during shuffle,  can result in wrong 
> results like data loss, if the partition task is retried.
> The reason being that in such cases, if even one partition task fails, the 
> whole shuffle stage needs to be re-attempted. This is to ensure that 
> situation arising in retrying of single failed task, can result in some rows 
> getting assigned to that partition , whose task has already finished.
> Spark's DagScheduler, TaskSchedulerImpl already expects such situations and 
> it relies on the boolean stage.isIndeterminate, to retry whole stage.
> This boolean is evaluated by consulting the RDD's dependency graph to find if 
> any dependency is inDeterministic.
> The bug exists in ShuffleDependency code which not consulting the hashing 
> expression to see if it has any component which is representing an 
> indeterministic value.
> Also, there is no way to identify if an attribute reference or an 
> expression's value has an indeterministic component
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51275) Session propagation in python readwrite

2025-02-20 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-51275:
-

 Summary: Session propagation in python readwrite
 Key: SPARK-51275
 URL: https://issues.apache.org/jira/browse/SPARK-51275
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, ML, PySpark
Affects Versions: 4.1
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51275) Session propagation in python readwrite

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51275:
---
Labels: pull-request-available  (was: )

> Session propagation in python readwrite
> ---
>
> Key: SPARK-51275
> URL: https://issues.apache.org/jira/browse/SPARK-51275
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 4.1
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-51267) Match local Spark Connect server logic between Python and Scala

2025-02-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-51267.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 50017
[https://github.com/apache/spark/pull/50017]

> Match local Spark Connect server logic between Python and Scala
> ---
>
> Key: SPARK-51267
> URL: https://issues.apache.org/jira/browse/SPARK-51267
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> There are some differences. We should match the logics



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-51267) Match local Spark Connect server logic between Python and Scala

2025-02-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-51267:


Assignee: Hyukjin Kwon

> Match local Spark Connect server logic between Python and Scala
> ---
>
> Key: SPARK-51267
> URL: https://issues.apache.org/jira/browse/SPARK-51267
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> There are some differences. We should match the logics



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48530) [M1] Support for local variables

2025-02-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48530.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 49445
[https://github.com/apache/spark/pull/49445]

> [M1] Support for local variables
> 
>
> Key: SPARK-48530
> URL: https://issues.apache.org/jira/browse/SPARK-48530
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> At the moment, variables in SQL scripts are creating session variables. We 
> don't want this, we want variables to be considered as local (within the 
> block/compound).
>  
> -To achieve this, we probably need to wait for labels support. Once we have 
> it, we can prepend variable names with labels to make distinction between 
> variables with the same name and only then reuse session variables mechanism 
> to save values with such composed names.-
> -If the block/compound doesn't have label, we should generate it 
> automatically (GUID or something similar).-
>  
> Labels support is done, so we can use them now for local variables - they 
> need to be able to be referenced by ..
> We cannot reuse session variables for this - we need to implement local 
> variables per script, with scoping and shadowing.
>  
> Session variables need to be able to be accessed by 
> `system.session.` at all times from the script though.
>  
> -Also, variables cannot recover from script failures - if a script fails, 
> variables won’t be dropped. If we intend to reuse session variables for local 
> variables, we should think about how to fix this.- We won't be using session 
> variables, so this shouldn't be a problem, but need to have it in mind 
> because wrong implementation can cause the same problem.
>  
> Use [Session Variables PR|https://github.com/apache/spark/pull/40474/files] 
> as a reference in which places variable changes need to happen.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48530) [M1] Support for local variables

2025-02-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48530:
---

Assignee: David Milicevic

> [M1] Support for local variables
> 
>
> Key: SPARK-48530
> URL: https://issues.apache.org/jira/browse/SPARK-48530
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
>
> At the moment, variables in SQL scripts are creating session variables. We 
> don't want this, we want variables to be considered as local (within the 
> block/compound).
>  
> -To achieve this, we probably need to wait for labels support. Once we have 
> it, we can prepend variable names with labels to make distinction between 
> variables with the same name and only then reuse session variables mechanism 
> to save values with such composed names.-
> -If the block/compound doesn't have label, we should generate it 
> automatically (GUID or something similar).-
>  
> Labels support is done, so we can use them now for local variables - they 
> need to be able to be referenced by ..
> We cannot reuse session variables for this - we need to implement local 
> variables per script, with scoping and shadowing.
>  
> Session variables need to be able to be accessed by 
> `system.session.` at all times from the script though.
>  
> -Also, variables cannot recover from script failures - if a script fails, 
> variables won’t be dropped. If we intend to reuse session variables for local 
> variables, we should think about how to fix this.- We won't be using session 
> variables, so this shouldn't be a problem, but need to have it in mind 
> because wrong implementation can cause the same problem.
>  
> Use [Session Variables PR|https://github.com/apache/spark/pull/40474/files] 
> as a reference in which places variable changes need to happen.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51269) SQLConf should manage the default value for avro compression level

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51269:
---
Labels: pull-request-available  (was: )

> SQLConf should manage the default value for avro compression level
> --
>
> Key: SPARK-51269
> URL: https://issues.apache.org/jira/browse/SPARK-51269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the default value of spark.sql.avro.deflate.level is -1. But it 
> managed with the enum AvroCompressionCodec.
> If developers use the config item in any other code path, there is no 
> guarantee for the default value.
> On the other hand, users will confused, if there is no default value in 
> SQLConf.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51269) Let the SQLConf control the default value for compression level

2025-02-20 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-51269:
---
Description: 
Currently, the default value of spark.sql.avro.deflate.level is -1. But it 
managed with the enum AvroCompressionCodec.
If developers use the config item in any other code path, there is no guarantee 
for the default value.
On the other hand, users will confused, if there is no default value in SQLConf.

> Let the SQLConf control the default value for compression level
> ---
>
> Key: SPARK-51269
> URL: https://issues.apache.org/jira/browse/SPARK-51269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, the default value of spark.sql.avro.deflate.level is -1. But it 
> managed with the enum AvroCompressionCodec.
> If developers use the config item in any other code path, there is no 
> guarantee for the default value.
> On the other hand, users will confused, if there is no default value in 
> SQLConf.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51269) SQLConf should manage the default value for compression level

2025-02-20 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-51269:
---
Summary: SQLConf should manage the default value for compression level  
(was: Let the SQLConf control the default value for compression level)

> SQLConf should manage the default value for compression level
> -
>
> Key: SPARK-51269
> URL: https://issues.apache.org/jira/browse/SPARK-51269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, the default value of spark.sql.avro.deflate.level is -1. But it 
> managed with the enum AvroCompressionCodec.
> If developers use the config item in any other code path, there is no 
> guarantee for the default value.
> On the other hand, users will confused, if there is no default value in 
> SQLConf.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51263) Clean up unnecessary `invokePrivate` method calls in the test code.

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51263:
---
Labels: pull-request-available  (was: )

> Clean up unnecessary `invokePrivate` method calls in the test code.
> ---
>
> Key: SPARK-51263
> URL: https://issues.apache.org/jira/browse/SPARK-51263
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL, Tests
>Affects Versions: 4.1.0
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49489) 'TTransportException: MaxMessageSize reached' occurs if get a table with a large number of partitions

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49489:
---
Labels: pull-request-available  (was: )

> 'TTransportException: MaxMessageSize reached' occurs if get a table with a 
> large number of partitions 
> --
>
> Key: SPARK-49489
> URL: https://issues.apache.org/jira/browse/SPARK-49489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Internal fix.png
>
>
> {noformat}
> org.apache.thrift.transport.TTransportException: MaxMessageSize reached
> at 
> org.apache.thrift.transport.TEndpointTransport.countConsumedMessageBytes(TEndpointTransport.java:96)
>  
> at 
> org.apache.thrift.transport.TMemoryInputTransport.read(TMemoryInputTransport.java:97)
>  
> at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:390) 
> at 
> org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:39)
>  
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) 
> at 
> org.apache.hadoop.hive.metastore.security.TFilterTransport.readAll(TFilterTransport.java:63)
>  
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:464) 
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readByte(TBinaryProtocol.java:329) 
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readFieldBegin(TBinaryProtocol.java:273)
>  
> at 
> org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.read(FieldSchema.java:461)
>  
> at 
> org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.read(FieldSchema.java:454)
>  
> at 
> org.apache.hadoop.hive.metastore.api.FieldSchema.read(FieldSchema.java:388) 
> at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.read(StorageDescriptor.java:1269)
>  
> at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.read(StorageDescriptor.java:1248)
>  
> at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor.read(StorageDescriptor.java:1110)
>  
> at 
> org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.read(Partition.java:1270)
>  
> at 
> org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.read(Partition.java:1205)
>  
> at org.apache.hadoop.hive.metastore.api.Partition.read(Partition.java:1062) 
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51270) Support new Variant types UUID, Time, and nanosecond timestamp

2025-02-20 Thread David Cashman (Jira)
David Cashman created SPARK-51270:
-

 Summary: Support new Variant types UUID, Time, and nanosecond 
timestamp
 Key: SPARK-51270
 URL: https://issues.apache.org/jira/browse/SPARK-51270
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.1
Reporter: David Cashman


Four new types were added to the Parquet Variant spec 
([https://github.com/apache/parquet-format/commit/25f05e73d8cd7f5c83532ce51cb4f4de8ba5f2a2):]
 UUID, Time, Timestamp(NANOS) and TimestampNTZ(NANOS).

Spark does not have corresponding types, but we should add support for basic 
Variant operations: extraction, cast to JSON/string, and reporting the type in 
SchemaOfVariant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51270) Support new Variant types UUID, Time, and nanosecond timestamp

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51270:
---
Labels: pull-request-available  (was: )

> Support new Variant types UUID, Time, and nanosecond timestamp
> --
>
> Key: SPARK-51270
> URL: https://issues.apache.org/jira/browse/SPARK-51270
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.1
>Reporter: David Cashman
>Priority: Major
>  Labels: pull-request-available
>
> Four new types were added to the Parquet Variant spec 
> ([https://github.com/apache/parquet-format/commit/25f05e73d8cd7f5c83532ce51cb4f4de8ba5f2a2):]
>  UUID, Time, Timestamp(NANOS) and TimestampNTZ(NANOS).
> Spark does not have corresponding types, but we should add support for basic 
> Variant operations: extraction, cast to JSON/string, and reporting the type 
> in SchemaOfVariant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-51266) Remove the no longer used definition of `private[spark] object TaskDetailsClassNames`

2025-02-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-51266:
-

Assignee: Yang Jie

> Remove the no longer used definition of `private[spark] object 
> TaskDetailsClassNames`
> -
>
> Key: SPARK-51266
> URL: https://issues.apache.org/jira/browse/SPARK-51266
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-51266) Remove the no longer used definition of `private[spark] object TaskDetailsClassNames`

2025-02-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-51266.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

Issue resolved by pull request 50016
[https://github.com/apache/spark/pull/50016]

> Remove the no longer used definition of `private[spark] object 
> TaskDetailsClassNames`
> -
>
> Key: SPARK-51266
> URL: https://issues.apache.org/jira/browse/SPARK-51266
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51266) Remove the no longer used definition of `private[spark] object TaskDetailsClassNames`

2025-02-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-51266:
--
Parent: SPARK-51166
Issue Type: Sub-task  (was: Improvement)

> Remove the no longer used definition of `private[spark] object 
> TaskDetailsClassNames`
> -
>
> Key: SPARK-51266
> URL: https://issues.apache.org/jira/browse/SPARK-51266
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-50785) [M1] FOR statement improvements

2025-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-50785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-50785:
---
Labels: pull-request-available  (was: )

> [M1] FOR statement improvements
> ---
>
> Key: SPARK-50785
> URL: https://issues.apache.org/jira/browse/SPARK-50785
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dusan Tisma
>Priority: Major
>  Labels: pull-request-available
>
> FOR statement should be updated to use the proper implementation of local 
> variables, once they are checked in.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51271) Python Data Sources Filter Pushdown API

2025-02-20 Thread Haoyu Weng (Jira)
Haoyu Weng created SPARK-51271:
--

 Summary: Python Data Sources Filter Pushdown API
 Key: SPARK-51271
 URL: https://issues.apache.org/jira/browse/SPARK-51271
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 4.1.0
Reporter: Haoyu Weng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-51097) Adding state store level metrics for last uploaded snapshot version in RocksDB

2025-02-20 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-51097:


Assignee: Zeyu Chen

> Adding state store level metrics for last uploaded snapshot version in RocksDB
> --
>
> Key: SPARK-51097
> URL: https://issues.apache.org/jira/browse/SPARK-51097
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0, 4.1.0
>Reporter: Zeyu Chen
>Assignee: Zeyu Chen
>Priority: Minor
>  Labels: pull-request-available
>
> We currently lack detailed visibility into state store level state 
> maintenance in RocksDB. This limitation affects the ability to identify 
> performance degradation issues behind maintenance tasks. 
> To remediate this, we will introduce state store "instance" metrics to 
> StreamingQueryProgress to track the latest snapshot version uploaded in 
> RocksDB.
> This improvement addresses three challenges in observability:
>  * Uneven partition starvation, where we need to identify partitions with 
> slow state maintenance,
>  * Finding missing snapshots across versions, so we minimize extensive 
> replays during recovery,
>  * Identify performance instability, such as gaining insights into snapshot 
> upload patterns



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-51097) Adding state store level metrics for last uploaded snapshot version in RocksDB

2025-02-20 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-51097.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 49816
[https://github.com/apache/spark/pull/49816]

> Adding state store level metrics for last uploaded snapshot version in RocksDB
> --
>
> Key: SPARK-51097
> URL: https://issues.apache.org/jira/browse/SPARK-51097
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0, 4.1.0
>Reporter: Zeyu Chen
>Assignee: Zeyu Chen
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We currently lack detailed visibility into state store level state 
> maintenance in RocksDB. This limitation affects the ability to identify 
> performance degradation issues behind maintenance tasks. 
> To remediate this, we will introduce state store "instance" metrics to 
> StreamingQueryProgress to track the latest snapshot version uploaded in 
> RocksDB.
> This improvement addresses three challenges in observability:
>  * Uneven partition starvation, where we need to identify partitions with 
> slow state maintenance,
>  * Finding missing snapshots across versions, so we minimize extensive 
> replays during recovery,
>  * Identify performance instability, such as gaining insights into snapshot 
> upload patterns



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51269) SQLConf should manage the default value for avro compression level

2025-02-20 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-51269:
---
Summary: SQLConf should manage the default value for avro compression level 
 (was: SQLConf should manage the default value for compression level)

> SQLConf should manage the default value for avro compression level
> --
>
> Key: SPARK-51269
> URL: https://issues.apache.org/jira/browse/SPARK-51269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, the default value of spark.sql.avro.deflate.level is -1. But it 
> managed with the enum AvroCompressionCodec.
> If developers use the config item in any other code path, there is no 
> guarantee for the default value.
> On the other hand, users will confused, if there is no default value in 
> SQLConf.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51092) Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms

2025-02-20 Thread Jonathan Albrecht (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Albrecht updated SPARK-51092:
--
Summary: Skip the v1 FlatMapGroupsWithState tests with timeout on big 
endian platforms  (was: FlatMapGroupsWithState: Change the dataType of the 
timestampTimeoutAttribute from IntegerType to LongType)

> Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms
> -
>
> Key: SPARK-51092
> URL: https://issues.apache.org/jira/browse/SPARK-51092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.4, 3.5.5, 4.1.0
>Reporter: Jonathan Albrecht
>Priority: Minor
>  Labels: big-endian, pull-request-available
>
> The {{dataType}} of the {{timestampTimeoutAttribute}} is IntegerType in 
> FlatMapGroupsWithStateExecHelper.scala for the v1 state manager but it is 
> serialized and stored in memory as a Long everywhere.
> This causes the {{timestampTimeout}} value to be corrupted on big endian 
> platforms. This is seen in unit tests relating to the v1 state manager when 
> run on big endian platforms.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51269) Let the SQLConf control the default value for compression level

2025-02-20 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-51269:
--

 Summary: Let the SQLConf control the default value for compression 
level
 Key: SPARK-51269
 URL: https://issues.apache.org/jira/browse/SPARK-51269
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51092) Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms

2025-02-20 Thread Jonathan Albrecht (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Albrecht updated SPARK-51092:
--
Description: 
The {{dataType}} of the {{timestampTimeoutAttribute}} is IntegerType in 
FlatMapGroupsWithStateExecHelper.scala for the v1 state manager but it is 
serialized and stored in memory as a Long everywhere.

This causes the {{timestampTimeout}} value to be corrupted on big endian 
platforms. This is seen in unit tests relating to the v1 state manager when run 
on big endian platforms.

This can't be fixed because it would be a breaking schema change so skip the 
tests instead.

  was:
The {{dataType}} of the {{timestampTimeoutAttribute}} is IntegerType in 
FlatMapGroupsWithStateExecHelper.scala for the v1 state manager but it is 
serialized and stored in memory as a Long everywhere.

This causes the {{timestampTimeout}} value to be corrupted on big endian 
platforms. This is seen in unit tests relating to the v1 state manager when run 
on big endian platforms.


> Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms
> -
>
> Key: SPARK-51092
> URL: https://issues.apache.org/jira/browse/SPARK-51092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.4, 3.5.5, 4.1.0
>Reporter: Jonathan Albrecht
>Priority: Minor
>  Labels: big-endian, pull-request-available
>
> The {{dataType}} of the {{timestampTimeoutAttribute}} is IntegerType in 
> FlatMapGroupsWithStateExecHelper.scala for the v1 state manager but it is 
> serialized and stored in memory as a Long everywhere.
> This causes the {{timestampTimeout}} value to be corrupted on big endian 
> platforms. This is seen in unit tests relating to the v1 state manager when 
> run on big endian platforms.
> This can't be fixed because it would be a breaking schema change so skip the 
> tests instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-51092) Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms

2025-02-20 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-51092.
--
Fix Version/s: 4.0.0
 Assignee: Jonathan Albrecht
   Resolution: Fixed

Issue resolved via https://github.com/apache/spark/pull/49811

> Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms
> -
>
> Key: SPARK-51092
> URL: https://issues.apache.org/jira/browse/SPARK-51092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.4, 3.5.5, 4.1.0
>Reporter: Jonathan Albrecht
>Assignee: Jonathan Albrecht
>Priority: Minor
>  Labels: big-endian, pull-request-available
> Fix For: 4.0.0
>
>
> The {{dataType}} of the {{timestampTimeoutAttribute}} is IntegerType in 
> FlatMapGroupsWithStateExecHelper.scala for the v1 state manager but it is 
> serialized and stored in memory as a Long everywhere.
> This causes the {{timestampTimeout}} value to be corrupted on big endian 
> platforms. This is seen in unit tests relating to the v1 state manager when 
> run on big endian platforms.
> This can't be fixed because it would be a breaking schema change so skip the 
> tests instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org