[jira] [Updated] (SPARK-51252) Adding state store level metrics for last uploaded snapshot version in HDFS State Stores
[ https://issues.apache.org/jira/browse/SPARK-51252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51252: --- Labels: pull-request-available (was: ) > Adding state store level metrics for last uploaded snapshot version in HDFS > State Stores > > > Key: SPARK-51252 > URL: https://issues.apache.org/jira/browse/SPARK-51252 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0, 4.1.0 >Reporter: Zeyu Chen >Priority: Minor > Labels: pull-request-available > > Similarly to SPARK-51097, we would also like to introduce a similar level of > observability to HDFSBackedStateStore. > The introduction of state store "instance" metrics to StreamingQueryProgress > to track the latest snapshot version uploaded in HDFS state stores should > address three challenges in observability: > * Uneven partition starvation, where we need to identify partitions with > slow state maintenance, > * Finding missing snapshots across versions, so we minimize extensive > replays during recovery, > * Identify performance instability, such as gaining insights into snapshot > upload patterns > The instance metrics should be kept as generalized as possible, so that > future instance metrics for observability can be added with minimal > refactoring. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51274) PySparkLogger should respect the expected keyword arguments.
[ https://issues.apache.org/jira/browse/SPARK-51274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51274: --- Labels: pull-request-available (was: ) > PySparkLogger should respect the expected keyword arguments. > > > Key: SPARK-51274 > URL: https://issues.apache.org/jira/browse/SPARK-51274 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51273) Spark Connect Call Procedure runs the procedure twice
Szehon Ho created SPARK-51273: - Summary: Spark Connect Call Procedure runs the procedure twice Key: SPARK-51273 URL: https://issues.apache.org/jira/browse/SPARK-51273 Project: Spark Issue Type: Bug Components: Connect, SQL Affects Versions: 4.0.0 Reporter: Szehon Ho Running 'call procedure' via Spark connect results in the procedure getting called twice. This is because the org.apache.spark.sql.connect.SparkSession.sql sends the plan over to be evaluated, and that invokes it once. This returns a org.apache.spark.sql.connect.DataSet, and then running df.collect() sends the plan to be evaluated, invoking it a second time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51274) PySparkLogger should respect the expected keyword arguments.
Takuya Ueshin created SPARK-51274: - Summary: PySparkLogger should respect the expected keyword arguments. Key: SPARK-51274 URL: https://issues.apache.org/jira/browse/SPARK-51274 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51185) Revert simplifications to PartitionedFileUtil API due to increased risk of OOM
[ https://issues.apache.org/jira/browse/SPARK-51185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-51185: Fix Version/s: 3.5.5 > Revert simplifications to PartitionedFileUtil API due to increased risk of OOM > -- > > Key: SPARK-51185 > URL: https://issues.apache.org/jira/browse/SPARK-51185 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Lukas Rupprecht >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.5 > > > A prior ticket (see https://issues.apache.org/jira/browse/SPARK-44081) > implemented a simplification to the PartitionedFileUtils API that removed > redundant `path` arguments, which can be obtained directly from the > `FileStatusWithMetadata` object. To avoid repeatedly having to compute the > path, this change changed `FileStatusWithMetadata.getPath` from a def to a > lazy val. However, this increases the memory requirements as the path now > needs to be kept in memory throughout the lifetime of a > `FileStatusWithMetadata` object. This can lead to OOMs so we should revert > this change and make `getPath` to be a `def`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51272) Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key
Asif created SPARK-51272: Summary: Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key Key: SPARK-51272 URL: https://issues.apache.org/jira/browse/SPARK-51272 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.4 Reporter: Asif Fix For: 4.0.0 In DagScheduler, where a successful task completion occurs concurrently with a task failure , for an inDeterminate stage, results in a situation , where instead of re-executing all partitions, only some are retried. This results in data loss. The race condition identified is as follows: a) A successful result stage task, is yet to mark in the boolean array tracking partitions success/failure as true/false. b) A concurrent failed result task, belonging to an InDeterminate stage, idenitfies all the stages which needs/ can be rolled back. For Result Stage, it looks into the array of successful partitions. As none is marked as true, the ResultStage and dependent stages are delegated to thread pool for retry. c) Between the time of collecting stages to rollback and re-try of stages, the successful task marks boolean as true. d) The Retry of Stage, as a result, misses the partition marked as successful, for retry. Attaching two files for reproducing the functional bug , showing the race condition causing data corruption. I am attaching 2 files for bug test # bugrepro.patch This is needed to coax the single VM test to reproduce the issue. It has lots of interception and tweaks to ensure that system is able to hit the data loss situation. ( like each partition writes only a shuffle file containing keys evaluating to same hashCode and deleting the shuffle file at right time etc) # The BugTest itself. a) If the bugrepro.patch is applied to current master and the BugTest run, it will fail immediately with assertion failure where instead of 12 rows, 6 rows show up in result. b) If the bugrepro.patch is applied on top of PR [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029] , then the BugTest will fail after one or two or more iterations, indicating the race condition in DataScheduler/Stage interaction. c) But if the same BugTest is run on branch containing fix for this bug as well as the PR [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029], it will pass in all the 100 iteration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51272) Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key
[ https://issues.apache.org/jira/browse/SPARK-51272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-51272: - Attachment: BugTest.txt > Race condition in DagScheduler can result in failure of retrying all > partitions for non deterministic partitioning key > -- > > Key: SPARK-51272 > URL: https://issues.apache.org/jira/browse/SPARK-51272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.4 >Reporter: Asif >Priority: Major > Fix For: 4.0.0 > > Attachments: BugTest.txt > > > In DagScheduler, where a successful task completion occurs concurrently with > a task failure , for an inDeterminate stage, results in a situation , where > instead of re-executing all partitions, only some are retried. This results > in data loss. > The race condition identified is as follows: > a) A successful result stage task, is yet to mark in the boolean array > tracking partitions success/failure as true/false. > b) A concurrent failed result task, belonging to an InDeterminate stage, > idenitfies all the stages which needs/ can be rolled back. For Result Stage, > it looks into the array of successful partitions. As none is marked as true, > the ResultStage and dependent stages are delegated to thread pool for retry. > c) Between the time of collecting stages to rollback and re-try of stages, > the successful task marks boolean as true. > d) The Retry of Stage, as a result, misses the partition marked as > successful, for retry. > > Attaching two files for reproducing the functional bug , showing the race > condition causing data corruption. > I am attaching 2 files for bug test > # bugrepro.patch > This is needed to coax the single VM test to reproduce the issue. It has lots > of interception and tweaks to ensure that system is able to hit the data loss > situation. > ( like each partition writes only a shuffle file containing keys evaluating > to same hashCode and deleting the shuffle file at right time etc) > # The BugTest itself. > a) If the bugrepro.patch is applied to current master and the BugTest run, it > will fail immediately with assertion failure where instead of 12 rows, 6 rows > show up in result. > b) If the bugrepro.patch is applied on top of PR > [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029] > , then the BugTest will fail after one or two or more iterations, > indicating the race condition in DataScheduler/Stage interaction. > c) But if the same BugTest is run on branch containing fix for this bug as > well as the PR > [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029], > it will pass in all the 100 iteration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-51272) Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key
[ https://issues.apache.org/jira/browse/SPARK-51272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928944#comment-17928944 ] Asif commented on SPARK-51272: -- [^BugTest.txt] [^bugrepro.patch] > Race condition in DagScheduler can result in failure of retrying all > partitions for non deterministic partitioning key > -- > > Key: SPARK-51272 > URL: https://issues.apache.org/jira/browse/SPARK-51272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.4 >Reporter: Asif >Priority: Major > Fix For: 4.0.0 > > Attachments: BugTest.txt, bugrepro.patch > > > In DagScheduler, where a successful task completion occurs concurrently with > a task failure , for an inDeterminate stage, results in a situation , where > instead of re-executing all partitions, only some are retried. This results > in data loss. > The race condition identified is as follows: > a) A successful result stage task, is yet to mark in the boolean array > tracking partitions success/failure as true/false. > b) A concurrent failed result task, belonging to an InDeterminate stage, > idenitfies all the stages which needs/ can be rolled back. For Result Stage, > it looks into the array of successful partitions. As none is marked as true, > the ResultStage and dependent stages are delegated to thread pool for retry. > c) Between the time of collecting stages to rollback and re-try of stages, > the successful task marks boolean as true. > d) The Retry of Stage, as a result, misses the partition marked as > successful, for retry. > > Attaching two files for reproducing the functional bug , showing the race > condition causing data corruption. > I am attaching 2 files for bug test > # bugrepro.patch > This is needed to coax the single VM test to reproduce the issue. It has lots > of interception and tweaks to ensure that system is able to hit the data loss > situation. > ( like each partition writes only a shuffle file containing keys evaluating > to same hashCode and deleting the shuffle file at right time etc) > # The BugTest itself. > a) If the bugrepro.patch is applied to current master and the BugTest run, it > will fail immediately with assertion failure where instead of 12 rows, 6 rows > show up in result. > b) If the bugrepro.patch is applied on top of PR > [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029] > , then the BugTest will fail after one or two or more iterations, > indicating the race condition in DataScheduler/Stage interaction. > c) But if the same BugTest is run on branch containing fix for this bug as > well as the PR > [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029], > it will pass in all the 100 iteration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51272) Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key
[ https://issues.apache.org/jira/browse/SPARK-51272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-51272: - Labels: spark-core (was: ) > Race condition in DagScheduler can result in failure of retrying all > partitions for non deterministic partitioning key > -- > > Key: SPARK-51272 > URL: https://issues.apache.org/jira/browse/SPARK-51272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.4 >Reporter: Asif >Priority: Major > Labels: spark-core > Fix For: 4.0.0 > > Attachments: BugTest.txt, bugrepro.patch > > > In DagScheduler, where a successful task completion occurs concurrently with > a task failure , for an inDeterminate stage, results in a situation , where > instead of re-executing all partitions, only some are retried. This results > in data loss. > The race condition identified is as follows: > a) A successful result stage task, is yet to mark in the boolean array > tracking partitions success/failure as true/false. > b) A concurrent failed result task, belonging to an InDeterminate stage, > idenitfies all the stages which needs/ can be rolled back. For Result Stage, > it looks into the array of successful partitions. As none is marked as true, > the ResultStage and dependent stages are delegated to thread pool for retry. > c) Between the time of collecting stages to rollback and re-try of stages, > the successful task marks boolean as true. > d) The Retry of Stage, as a result, misses the partition marked as > successful, for retry. > > Attaching two files for reproducing the functional bug , showing the race > condition causing data corruption. > I am attaching 2 files for bug test > # bugrepro.patch > This is needed to coax the single VM test to reproduce the issue. It has lots > of interception and tweaks to ensure that system is able to hit the data loss > situation. > ( like each partition writes only a shuffle file containing keys evaluating > to same hashCode and deleting the shuffle file at right time etc) > # The BugTest itself. > a) If the bugrepro.patch is applied to current master and the BugTest run, it > will fail immediately with assertion failure where instead of 12 rows, 6 rows > show up in result. > b) If the bugrepro.patch is applied on top of PR > [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029] > , then the BugTest will fail after one or two or more iterations, > indicating the race condition in DataScheduler/Stage interaction. > c) But if the same BugTest is run on branch containing fix for this bug as > well as the PR > [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029], > it will pass in all the 100 iteration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51272) Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key
[ https://issues.apache.org/jira/browse/SPARK-51272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-51272: - Attachment: bugrepro.patch > Race condition in DagScheduler can result in failure of retrying all > partitions for non deterministic partitioning key > -- > > Key: SPARK-51272 > URL: https://issues.apache.org/jira/browse/SPARK-51272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.4 >Reporter: Asif >Priority: Major > Fix For: 4.0.0 > > Attachments: BugTest.txt, bugrepro.patch > > > In DagScheduler, where a successful task completion occurs concurrently with > a task failure , for an inDeterminate stage, results in a situation , where > instead of re-executing all partitions, only some are retried. This results > in data loss. > The race condition identified is as follows: > a) A successful result stage task, is yet to mark in the boolean array > tracking partitions success/failure as true/false. > b) A concurrent failed result task, belonging to an InDeterminate stage, > idenitfies all the stages which needs/ can be rolled back. For Result Stage, > it looks into the array of successful partitions. As none is marked as true, > the ResultStage and dependent stages are delegated to thread pool for retry. > c) Between the time of collecting stages to rollback and re-try of stages, > the successful task marks boolean as true. > d) The Retry of Stage, as a result, misses the partition marked as > successful, for retry. > > Attaching two files for reproducing the functional bug , showing the race > condition causing data corruption. > I am attaching 2 files for bug test > # bugrepro.patch > This is needed to coax the single VM test to reproduce the issue. It has lots > of interception and tweaks to ensure that system is able to hit the data loss > situation. > ( like each partition writes only a shuffle file containing keys evaluating > to same hashCode and deleting the shuffle file at right time etc) > # The BugTest itself. > a) If the bugrepro.patch is applied to current master and the BugTest run, it > will fail immediately with assertion failure where instead of 12 rows, 6 rows > show up in result. > b) If the bugrepro.patch is applied on top of PR > [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029] > , then the BugTest will fail after one or two or more iterations, > indicating the race condition in DataScheduler/Stage interaction. > c) But if the same BugTest is run on branch containing fix for this bug as > well as the PR > [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029], > it will pass in all the 100 iteration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-51275) Session propagation in python readwrite
[ https://issues.apache.org/jira/browse/SPARK-51275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-51275: - Assignee: Ruifeng Zheng > Session propagation in python readwrite > --- > > Key: SPARK-51275 > URL: https://issues.apache.org/jira/browse/SPARK-51275 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML, PySpark >Affects Versions: 4.1 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51262) exceptAll not working with drop_duplicates using subset
[ https://issues.apache.org/jira/browse/SPARK-51262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolau Balbino updated SPARK-51262: Description: When using drop_duplicate with subset and after use exceptAll method, when calling some action (isEmpty, show, collect, count) raises a Py4J error. Searching web, this issues is related here: [https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-39612,] also marked as resolved. I tested locally with version 3.5.3 and also AWS Glue 5.0, using 3.5. was: When using drop_duplicate with subset and after uses exceptAll method, when calling some action (isEmpty, show, collect, count) raises a Py4J error. Searching web, this issues is related here: [https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-39612,] also marked as resolved. I tested locally with version 3.5.3 and also AWS Glue 5.0, using 3.5. > exceptAll not working with drop_duplicates using subset > --- > > Key: SPARK-51262 > URL: https://issues.apache.org/jira/browse/SPARK-51262 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0, 3.5.3 >Reporter: Nicolau Balbino >Priority: Minor > Labels: SQL > > When using drop_duplicate with subset and after use exceptAll method, when > calling some action (isEmpty, show, collect, count) raises a Py4J error. > Searching web, this issues is related here: > [https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-39612,] > also marked as resolved. > I tested locally with version 3.5.3 and also AWS Glue 5.0, using 3.5. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51273) Spark Connect Call Procedure runs the procedure twice
[ https://issues.apache.org/jira/browse/SPARK-51273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51273: --- Labels: pull-request-available (was: ) > Spark Connect Call Procedure runs the procedure twice > - > > Key: SPARK-51273 > URL: https://issues.apache.org/jira/browse/SPARK-51273 > Project: Spark > Issue Type: Bug > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Szehon Ho >Priority: Major > Labels: pull-request-available > > Running 'call procedure' via Spark connect results in the procedure getting > called twice. > > This is because the > org.apache.spark.sql.connect.SparkSession.sql sends the plan over to be > evaluated, and that invokes it once. > > This returns a org.apache.spark.sql.connect.DataSet, and then running > df.collect() sends the plan to be evaluated, invoking it a second time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-50864) Optimize and reeanble slow pytorch tests
[ https://issues.apache.org/jira/browse/SPARK-50864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-50864: --- Labels: pull-request-available (was: ) > Optimize and reeanble slow pytorch tests > > > Key: SPARK-50864 > URL: https://issues.apache.org/jira/browse/SPARK-50864 > Project: Spark > Issue Type: Test > Components: ML, PySpark, Tests >Affects Versions: 4.0.0, 4.1 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51278) Use appropriate structure of JSON format for PySparkLogger
[ https://issues.apache.org/jira/browse/SPARK-51278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51278: --- Labels: pull-request-available (was: ) > Use appropriate structure of JSON format for PySparkLogger > -- > > Key: SPARK-51278 > URL: https://issues.apache.org/jira/browse/SPARK-51278 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.1.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-51249) Fix version encoding bug in NoPrefixKeyStateEncoder
[ https://issues.apache.org/jira/browse/SPARK-51249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-51249. -- Resolution: Fixed Issue resolved by pull request 49996 [https://github.com/apache/spark/pull/49996] > Fix version encoding bug in NoPrefixKeyStateEncoder > --- > > Key: SPARK-51249 > URL: https://issues.apache.org/jira/browse/SPARK-51249 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 4.0.0, 4.1 >Reporter: Eric Marnadi >Assignee: Eric Marnadi >Priority: Blocker > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently, we add two version bytes to the NoPrefixKeyStateEncoder when > column families is enabled. However, this is unintended, and we only want to > add one version byte. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-51249) Fix version encoding bug in NoPrefixKeyStateEncoder
[ https://issues.apache.org/jira/browse/SPARK-51249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-51249: Assignee: Eric Marnadi > Fix version encoding bug in NoPrefixKeyStateEncoder > --- > > Key: SPARK-51249 > URL: https://issues.apache.org/jira/browse/SPARK-51249 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 4.0.0, 4.1 >Reporter: Eric Marnadi >Assignee: Eric Marnadi >Priority: Blocker > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently, we add two version bytes to the NoPrefixKeyStateEncoder when > column families is enabled. However, this is unintended, and we only want to > add one version byte. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-50864) Optimize and reeanble slow pytorch tests
[ https://issues.apache.org/jira/browse/SPARK-50864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-50864: -- Affects Version/s: 4.1 > Optimize and reeanble slow pytorch tests > > > Key: SPARK-50864 > URL: https://issues.apache.org/jira/browse/SPARK-50864 > Project: Spark > Issue Type: Test > Components: ML, PySpark, Tests >Affects Versions: 4.0.0, 4.1 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51278) Use appropriate structure of JSON format for PySparkLogger
Haejoon Lee created SPARK-51278: --- Summary: Use appropriate structure of JSON format for PySparkLogger Key: SPARK-51278 URL: https://issues.apache.org/jira/browse/SPARK-51278 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 4.1.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51272) Race condition in DagScheduler can result in failure of retrying all partitions for non deterministic partitioning key
[ https://issues.apache.org/jira/browse/SPARK-51272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51272: --- Labels: pull-request-available spark-core (was: spark-core) > Race condition in DagScheduler can result in failure of retrying all > partitions for non deterministic partitioning key > -- > > Key: SPARK-51272 > URL: https://issues.apache.org/jira/browse/SPARK-51272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.4 >Reporter: Asif >Priority: Major > Labels: pull-request-available, spark-core > Fix For: 4.0.0 > > Attachments: BugTest.txt, bugrepro.patch > > > In DagScheduler, where a successful task completion occurs concurrently with > a task failure , for an inDeterminate stage, results in a situation , where > instead of re-executing all partitions, only some are retried. This results > in data loss. > The race condition identified is as follows: > a) A successful result stage task, is yet to mark in the boolean array > tracking partitions success/failure as true/false. > b) A concurrent failed result task, belonging to an InDeterminate stage, > idenitfies all the stages which needs/ can be rolled back. For Result Stage, > it looks into the array of successful partitions. As none is marked as true, > the ResultStage and dependent stages are delegated to thread pool for retry. > c) Between the time of collecting stages to rollback and re-try of stages, > the successful task marks boolean as true. > d) The Retry of Stage, as a result, misses the partition marked as > successful, for retry. > > Attaching two files for reproducing the functional bug , showing the race > condition causing data corruption. > I am attaching 2 files for bug test > # bugrepro.patch > This is needed to coax the single VM test to reproduce the issue. It has lots > of interception and tweaks to ensure that system is able to hit the data loss > situation. > ( like each partition writes only a shuffle file containing keys evaluating > to same hashCode and deleting the shuffle file at right time etc) > # The BugTest itself. > a) If the bugrepro.patch is applied to current master and the BugTest run, it > will fail immediately with assertion failure where instead of 12 rows, 6 rows > show up in result. > b) If the bugrepro.patch is applied on top of PR > [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029] > , then the BugTest will fail after one or two or more iterations, > indicating the race condition in DataScheduler/Stage interaction. > c) But if the same BugTest is run on branch containing fix for this bug as > well as the PR > [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029], > it will pass in all the 100 iteration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51276) Enable spark.sql.execution.arrow.pyspark.enabled by default
Hyukjin Kwon created SPARK-51276: Summary: Enable spark.sql.execution.arrow.pyspark.enabled by default Key: SPARK-51276 URL: https://issues.apache.org/jira/browse/SPARK-51276 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon spark.sql.execution.arrow.pyspark.enabled has been there few years, and this is stable. We can enable it by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51276) Enable spark.sql.execution.arrow.pyspark.enabled by default
[ https://issues.apache.org/jira/browse/SPARK-51276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51276: --- Labels: pull-request-available (was: ) > Enable spark.sql.execution.arrow.pyspark.enabled by default > --- > > Key: SPARK-51276 > URL: https://issues.apache.org/jira/browse/SPARK-51276 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > spark.sql.execution.arrow.pyspark.enabled has been there few years, and this > is stable. We can enable it by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51276) Enable spark.sql.execution.arrow.pyspark.enabled by default
[ https://issues.apache.org/jira/browse/SPARK-51276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-51276: - Labels: release-notes (was: pull-request-available) > Enable spark.sql.execution.arrow.pyspark.enabled by default > --- > > Key: SPARK-51276 > URL: https://issues.apache.org/jira/browse/SPARK-51276 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: release-notes > > spark.sql.execution.arrow.pyspark.enabled has been there few years, and this > is stable. We can enable it by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48516) Turn on Arrow optimization for Python UDFs by default
[ https://issues.apache.org/jira/browse/SPARK-48516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-48516: - Labels: pull-request-available release-notes (was: pull-request-available) > Turn on Arrow optimization for Python UDFs by default > - > > Key: SPARK-48516 > URL: https://issues.apache.org/jira/browse/SPARK-48516 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Labels: pull-request-available, release-notes > Fix For: 4.0.0 > > > Turn on Arrow optimization for Python UDFs by default -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51277) Implement 0-arg implementation in Arrow-optimized Python UDF
Hyukjin Kwon created SPARK-51277: Summary: Implement 0-arg implementation in Arrow-optimized Python UDF Key: SPARK-51277 URL: https://issues.apache.org/jira/browse/SPARK-51277 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon The reason why 0-arg with pandas UDFs are not implemented is that you don't know how many to return but in Arrow-optimized Python UDFs we know. so we can implement this -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-51016) result data compromised in case of indeterministic join keys in Outer Join op, when retry happens
[ https://issues.apache.org/jira/browse/SPARK-51016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17924245#comment-17924245 ] Asif edited comment on SPARK-51016 at 2/20/25 11:20 PM: Pull Request: [https://github.com/apache/spark/pull/50029|https://github.com/apache/spark/pull/50029] was (Author: ashahid7): Pull Request: [https://github.com/apache/spark/pull/49708|https://github.com/apache/spark/pull/49708] > result data compromised in case of indeterministic join keys in Outer Join > op, when retry happens > - > > Key: SPARK-51016 > URL: https://issues.apache.org/jira/browse/SPARK-51016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.4 >Reporter: Asif >Priority: Major > Labels: pull-request-available, spark-sql > > For a query like: > {quote}val outerDf = spark.createDataset( > Seq((1L, "aa"), (null, "aa"), (2L, "bb"), (null, "bb"), (3L, "cc"), (null, > "cc")))( > Encoders.tupleEncoder(Encoders.LONG, Encoders.STRING)).toDF("pkLeftt", > "strleft") > val innerDf = spark.createDataset( > Seq((1L, "11"), (2L, "22"), (3L, "33")))( > Encoders.tupleEncoder(Encoders.LONG, Encoders.STRING)).toDF("pkRight", > "strright") > val leftOuter = outerDf.select( > col("strleft"), when(isnull(col("pkLeftt")), floor(rand() * > Literal(1000L)). > cast(LongType)). > otherwise(col("pkLeftt")).as("pkLeft")) > val outerjoin = leftOuter.hint("shuffle_hash"). > join(innerDf, col("pkLeft") === col("pkRight"), "left_outer") > {quote} > > where an arbitrary long value is assigned to the left table's joining key , > if it is null, so that skew is avoided during shuffle, can result in wrong > results like data loss, if the partition task is retried. > The reason being that in such cases, if even one partition task fails, the > whole shuffle stage needs to be re-attempted. This is to ensure that > situation arising in retrying of single failed task, can result in some rows > getting assigned to that partition , whose task has already finished. > Spark's DagScheduler, TaskSchedulerImpl already expects such situations and > it relies on the boolean stage.isIndeterminate, to retry whole stage. > This boolean is evaluated by consulting the RDD's dependency graph to find if > any dependency is inDeterministic. > The bug exists in ShuffleDependency code which not consulting the hashing > expression to see if it has any component which is representing an > indeterministic value. > Also, there is no way to identify if an attribute reference or an > expression's value has an indeterministic component > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51275) Session propagation in python readwrite
Ruifeng Zheng created SPARK-51275: - Summary: Session propagation in python readwrite Key: SPARK-51275 URL: https://issues.apache.org/jira/browse/SPARK-51275 Project: Spark Issue Type: Sub-task Components: Connect, ML, PySpark Affects Versions: 4.1 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51275) Session propagation in python readwrite
[ https://issues.apache.org/jira/browse/SPARK-51275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51275: --- Labels: pull-request-available (was: ) > Session propagation in python readwrite > --- > > Key: SPARK-51275 > URL: https://issues.apache.org/jira/browse/SPARK-51275 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML, PySpark >Affects Versions: 4.1 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-51267) Match local Spark Connect server logic between Python and Scala
[ https://issues.apache.org/jira/browse/SPARK-51267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-51267. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 50017 [https://github.com/apache/spark/pull/50017] > Match local Spark Connect server logic between Python and Scala > --- > > Key: SPARK-51267 > URL: https://issues.apache.org/jira/browse/SPARK-51267 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > There are some differences. We should match the logics -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-51267) Match local Spark Connect server logic between Python and Scala
[ https://issues.apache.org/jira/browse/SPARK-51267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-51267: Assignee: Hyukjin Kwon > Match local Spark Connect server logic between Python and Scala > --- > > Key: SPARK-51267 > URL: https://issues.apache.org/jira/browse/SPARK-51267 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > There are some differences. We should match the logics -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48530) [M1] Support for local variables
[ https://issues.apache.org/jira/browse/SPARK-48530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48530. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 49445 [https://github.com/apache/spark/pull/49445] > [M1] Support for local variables > > > Key: SPARK-48530 > URL: https://issues.apache.org/jira/browse/SPARK-48530 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: David Milicevic >Assignee: David Milicevic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > At the moment, variables in SQL scripts are creating session variables. We > don't want this, we want variables to be considered as local (within the > block/compound). > > -To achieve this, we probably need to wait for labels support. Once we have > it, we can prepend variable names with labels to make distinction between > variables with the same name and only then reuse session variables mechanism > to save values with such composed names.- > -If the block/compound doesn't have label, we should generate it > automatically (GUID or something similar).- > > Labels support is done, so we can use them now for local variables - they > need to be able to be referenced by .. > We cannot reuse session variables for this - we need to implement local > variables per script, with scoping and shadowing. > > Session variables need to be able to be accessed by > `system.session.` at all times from the script though. > > -Also, variables cannot recover from script failures - if a script fails, > variables won’t be dropped. If we intend to reuse session variables for local > variables, we should think about how to fix this.- We won't be using session > variables, so this shouldn't be a problem, but need to have it in mind > because wrong implementation can cause the same problem. > > Use [Session Variables PR|https://github.com/apache/spark/pull/40474/files] > as a reference in which places variable changes need to happen. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48530) [M1] Support for local variables
[ https://issues.apache.org/jira/browse/SPARK-48530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48530: --- Assignee: David Milicevic > [M1] Support for local variables > > > Key: SPARK-48530 > URL: https://issues.apache.org/jira/browse/SPARK-48530 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: David Milicevic >Assignee: David Milicevic >Priority: Major > Labels: pull-request-available > > At the moment, variables in SQL scripts are creating session variables. We > don't want this, we want variables to be considered as local (within the > block/compound). > > -To achieve this, we probably need to wait for labels support. Once we have > it, we can prepend variable names with labels to make distinction between > variables with the same name and only then reuse session variables mechanism > to save values with such composed names.- > -If the block/compound doesn't have label, we should generate it > automatically (GUID or something similar).- > > Labels support is done, so we can use them now for local variables - they > need to be able to be referenced by .. > We cannot reuse session variables for this - we need to implement local > variables per script, with scoping and shadowing. > > Session variables need to be able to be accessed by > `system.session.` at all times from the script though. > > -Also, variables cannot recover from script failures - if a script fails, > variables won’t be dropped. If we intend to reuse session variables for local > variables, we should think about how to fix this.- We won't be using session > variables, so this shouldn't be a problem, but need to have it in mind > because wrong implementation can cause the same problem. > > Use [Session Variables PR|https://github.com/apache/spark/pull/40474/files] > as a reference in which places variable changes need to happen. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51269) SQLConf should manage the default value for avro compression level
[ https://issues.apache.org/jira/browse/SPARK-51269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51269: --- Labels: pull-request-available (was: ) > SQLConf should manage the default value for avro compression level > -- > > Key: SPARK-51269 > URL: https://issues.apache.org/jira/browse/SPARK-51269 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > > Currently, the default value of spark.sql.avro.deflate.level is -1. But it > managed with the enum AvroCompressionCodec. > If developers use the config item in any other code path, there is no > guarantee for the default value. > On the other hand, users will confused, if there is no default value in > SQLConf. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51269) Let the SQLConf control the default value for compression level
[ https://issues.apache.org/jira/browse/SPARK-51269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng updated SPARK-51269: --- Description: Currently, the default value of spark.sql.avro.deflate.level is -1. But it managed with the enum AvroCompressionCodec. If developers use the config item in any other code path, there is no guarantee for the default value. On the other hand, users will confused, if there is no default value in SQLConf. > Let the SQLConf control the default value for compression level > --- > > Key: SPARK-51269 > URL: https://issues.apache.org/jira/browse/SPARK-51269 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > > Currently, the default value of spark.sql.avro.deflate.level is -1. But it > managed with the enum AvroCompressionCodec. > If developers use the config item in any other code path, there is no > guarantee for the default value. > On the other hand, users will confused, if there is no default value in > SQLConf. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51269) SQLConf should manage the default value for compression level
[ https://issues.apache.org/jira/browse/SPARK-51269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng updated SPARK-51269: --- Summary: SQLConf should manage the default value for compression level (was: Let the SQLConf control the default value for compression level) > SQLConf should manage the default value for compression level > - > > Key: SPARK-51269 > URL: https://issues.apache.org/jira/browse/SPARK-51269 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > > Currently, the default value of spark.sql.avro.deflate.level is -1. But it > managed with the enum AvroCompressionCodec. > If developers use the config item in any other code path, there is no > guarantee for the default value. > On the other hand, users will confused, if there is no default value in > SQLConf. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51263) Clean up unnecessary `invokePrivate` method calls in the test code.
[ https://issues.apache.org/jira/browse/SPARK-51263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51263: --- Labels: pull-request-available (was: ) > Clean up unnecessary `invokePrivate` method calls in the test code. > --- > > Key: SPARK-51263 > URL: https://issues.apache.org/jira/browse/SPARK-51263 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL, Tests >Affects Versions: 4.1.0 >Reporter: Yang Jie >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49489) 'TTransportException: MaxMessageSize reached' occurs if get a table with a large number of partitions
[ https://issues.apache.org/jira/browse/SPARK-49489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-49489: --- Labels: pull-request-available (was: ) > 'TTransportException: MaxMessageSize reached' occurs if get a table with a > large number of partitions > -- > > Key: SPARK-49489 > URL: https://issues.apache.org/jira/browse/SPARK-49489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > Attachments: Internal fix.png > > > {noformat} > org.apache.thrift.transport.TTransportException: MaxMessageSize reached > at > org.apache.thrift.transport.TEndpointTransport.countConsumedMessageBytes(TEndpointTransport.java:96) > > at > org.apache.thrift.transport.TMemoryInputTransport.read(TMemoryInputTransport.java:97) > > at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:390) > at > org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:39) > > at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) > at > org.apache.hadoop.hive.metastore.security.TFilterTransport.readAll(TFilterTransport.java:63) > > at > org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:464) > at > org.apache.thrift.protocol.TBinaryProtocol.readByte(TBinaryProtocol.java:329) > at > org.apache.thrift.protocol.TBinaryProtocol.readFieldBegin(TBinaryProtocol.java:273) > > at > org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.read(FieldSchema.java:461) > > at > org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.read(FieldSchema.java:454) > > at > org.apache.hadoop.hive.metastore.api.FieldSchema.read(FieldSchema.java:388) > at > org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.read(StorageDescriptor.java:1269) > > at > org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.read(StorageDescriptor.java:1248) > > at > org.apache.hadoop.hive.metastore.api.StorageDescriptor.read(StorageDescriptor.java:1110) > > at > org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.read(Partition.java:1270) > > at > org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.read(Partition.java:1205) > > at org.apache.hadoop.hive.metastore.api.Partition.read(Partition.java:1062) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51270) Support new Variant types UUID, Time, and nanosecond timestamp
David Cashman created SPARK-51270: - Summary: Support new Variant types UUID, Time, and nanosecond timestamp Key: SPARK-51270 URL: https://issues.apache.org/jira/browse/SPARK-51270 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.1 Reporter: David Cashman Four new types were added to the Parquet Variant spec ([https://github.com/apache/parquet-format/commit/25f05e73d8cd7f5c83532ce51cb4f4de8ba5f2a2):] UUID, Time, Timestamp(NANOS) and TimestampNTZ(NANOS). Spark does not have corresponding types, but we should add support for basic Variant operations: extraction, cast to JSON/string, and reporting the type in SchemaOfVariant. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51270) Support new Variant types UUID, Time, and nanosecond timestamp
[ https://issues.apache.org/jira/browse/SPARK-51270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51270: --- Labels: pull-request-available (was: ) > Support new Variant types UUID, Time, and nanosecond timestamp > -- > > Key: SPARK-51270 > URL: https://issues.apache.org/jira/browse/SPARK-51270 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.1 >Reporter: David Cashman >Priority: Major > Labels: pull-request-available > > Four new types were added to the Parquet Variant spec > ([https://github.com/apache/parquet-format/commit/25f05e73d8cd7f5c83532ce51cb4f4de8ba5f2a2):] > UUID, Time, Timestamp(NANOS) and TimestampNTZ(NANOS). > Spark does not have corresponding types, but we should add support for basic > Variant operations: extraction, cast to JSON/string, and reporting the type > in SchemaOfVariant. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-51266) Remove the no longer used definition of `private[spark] object TaskDetailsClassNames`
[ https://issues.apache.org/jira/browse/SPARK-51266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-51266: - Assignee: Yang Jie > Remove the no longer used definition of `private[spark] object > TaskDetailsClassNames` > - > > Key: SPARK-51266 > URL: https://issues.apache.org/jira/browse/SPARK-51266 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.1.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-51266) Remove the no longer used definition of `private[spark] object TaskDetailsClassNames`
[ https://issues.apache.org/jira/browse/SPARK-51266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-51266. --- Fix Version/s: 4.1.0 Resolution: Fixed Issue resolved by pull request 50016 [https://github.com/apache/spark/pull/50016] > Remove the no longer used definition of `private[spark] object > TaskDetailsClassNames` > - > > Key: SPARK-51266 > URL: https://issues.apache.org/jira/browse/SPARK-51266 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.1.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > Fix For: 4.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51266) Remove the no longer used definition of `private[spark] object TaskDetailsClassNames`
[ https://issues.apache.org/jira/browse/SPARK-51266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-51266: -- Parent: SPARK-51166 Issue Type: Sub-task (was: Improvement) > Remove the no longer used definition of `private[spark] object > TaskDetailsClassNames` > - > > Key: SPARK-51266 > URL: https://issues.apache.org/jira/browse/SPARK-51266 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.1.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > Fix For: 4.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-50785) [M1] FOR statement improvements
[ https://issues.apache.org/jira/browse/SPARK-50785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-50785: --- Labels: pull-request-available (was: ) > [M1] FOR statement improvements > --- > > Key: SPARK-50785 > URL: https://issues.apache.org/jira/browse/SPARK-50785 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dusan Tisma >Priority: Major > Labels: pull-request-available > > FOR statement should be updated to use the proper implementation of local > variables, once they are checked in. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51271) Python Data Sources Filter Pushdown API
Haoyu Weng created SPARK-51271: -- Summary: Python Data Sources Filter Pushdown API Key: SPARK-51271 URL: https://issues.apache.org/jira/browse/SPARK-51271 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 4.1.0 Reporter: Haoyu Weng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-51097) Adding state store level metrics for last uploaded snapshot version in RocksDB
[ https://issues.apache.org/jira/browse/SPARK-51097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-51097: Assignee: Zeyu Chen > Adding state store level metrics for last uploaded snapshot version in RocksDB > -- > > Key: SPARK-51097 > URL: https://issues.apache.org/jira/browse/SPARK-51097 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0, 4.1.0 >Reporter: Zeyu Chen >Assignee: Zeyu Chen >Priority: Minor > Labels: pull-request-available > > We currently lack detailed visibility into state store level state > maintenance in RocksDB. This limitation affects the ability to identify > performance degradation issues behind maintenance tasks. > To remediate this, we will introduce state store "instance" metrics to > StreamingQueryProgress to track the latest snapshot version uploaded in > RocksDB. > This improvement addresses three challenges in observability: > * Uneven partition starvation, where we need to identify partitions with > slow state maintenance, > * Finding missing snapshots across versions, so we minimize extensive > replays during recovery, > * Identify performance instability, such as gaining insights into snapshot > upload patterns -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-51097) Adding state store level metrics for last uploaded snapshot version in RocksDB
[ https://issues.apache.org/jira/browse/SPARK-51097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-51097. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 49816 [https://github.com/apache/spark/pull/49816] > Adding state store level metrics for last uploaded snapshot version in RocksDB > -- > > Key: SPARK-51097 > URL: https://issues.apache.org/jira/browse/SPARK-51097 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0, 4.1.0 >Reporter: Zeyu Chen >Assignee: Zeyu Chen >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > We currently lack detailed visibility into state store level state > maintenance in RocksDB. This limitation affects the ability to identify > performance degradation issues behind maintenance tasks. > To remediate this, we will introduce state store "instance" metrics to > StreamingQueryProgress to track the latest snapshot version uploaded in > RocksDB. > This improvement addresses three challenges in observability: > * Uneven partition starvation, where we need to identify partitions with > slow state maintenance, > * Finding missing snapshots across versions, so we minimize extensive > replays during recovery, > * Identify performance instability, such as gaining insights into snapshot > upload patterns -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51269) SQLConf should manage the default value for avro compression level
[ https://issues.apache.org/jira/browse/SPARK-51269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng updated SPARK-51269: --- Summary: SQLConf should manage the default value for avro compression level (was: SQLConf should manage the default value for compression level) > SQLConf should manage the default value for avro compression level > -- > > Key: SPARK-51269 > URL: https://issues.apache.org/jira/browse/SPARK-51269 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > > Currently, the default value of spark.sql.avro.deflate.level is -1. But it > managed with the enum AvroCompressionCodec. > If developers use the config item in any other code path, there is no > guarantee for the default value. > On the other hand, users will confused, if there is no default value in > SQLConf. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51092) Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms
[ https://issues.apache.org/jira/browse/SPARK-51092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Albrecht updated SPARK-51092: -- Summary: Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms (was: FlatMapGroupsWithState: Change the dataType of the timestampTimeoutAttribute from IntegerType to LongType) > Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms > - > > Key: SPARK-51092 > URL: https://issues.apache.org/jira/browse/SPARK-51092 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.4, 3.5.5, 4.1.0 >Reporter: Jonathan Albrecht >Priority: Minor > Labels: big-endian, pull-request-available > > The {{dataType}} of the {{timestampTimeoutAttribute}} is IntegerType in > FlatMapGroupsWithStateExecHelper.scala for the v1 state manager but it is > serialized and stored in memory as a Long everywhere. > This causes the {{timestampTimeout}} value to be corrupted on big endian > platforms. This is seen in unit tests relating to the v1 state manager when > run on big endian platforms. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51269) Let the SQLConf control the default value for compression level
Jiaan Geng created SPARK-51269: -- Summary: Let the SQLConf control the default value for compression level Key: SPARK-51269 URL: https://issues.apache.org/jira/browse/SPARK-51269 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Jiaan Geng Assignee: Jiaan Geng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51092) Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms
[ https://issues.apache.org/jira/browse/SPARK-51092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Albrecht updated SPARK-51092: -- Description: The {{dataType}} of the {{timestampTimeoutAttribute}} is IntegerType in FlatMapGroupsWithStateExecHelper.scala for the v1 state manager but it is serialized and stored in memory as a Long everywhere. This causes the {{timestampTimeout}} value to be corrupted on big endian platforms. This is seen in unit tests relating to the v1 state manager when run on big endian platforms. This can't be fixed because it would be a breaking schema change so skip the tests instead. was: The {{dataType}} of the {{timestampTimeoutAttribute}} is IntegerType in FlatMapGroupsWithStateExecHelper.scala for the v1 state manager but it is serialized and stored in memory as a Long everywhere. This causes the {{timestampTimeout}} value to be corrupted on big endian platforms. This is seen in unit tests relating to the v1 state manager when run on big endian platforms. > Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms > - > > Key: SPARK-51092 > URL: https://issues.apache.org/jira/browse/SPARK-51092 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.4, 3.5.5, 4.1.0 >Reporter: Jonathan Albrecht >Priority: Minor > Labels: big-endian, pull-request-available > > The {{dataType}} of the {{timestampTimeoutAttribute}} is IntegerType in > FlatMapGroupsWithStateExecHelper.scala for the v1 state manager but it is > serialized and stored in memory as a Long everywhere. > This causes the {{timestampTimeout}} value to be corrupted on big endian > platforms. This is seen in unit tests relating to the v1 state manager when > run on big endian platforms. > This can't be fixed because it would be a breaking schema change so skip the > tests instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-51092) Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms
[ https://issues.apache.org/jira/browse/SPARK-51092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-51092. -- Fix Version/s: 4.0.0 Assignee: Jonathan Albrecht Resolution: Fixed Issue resolved via https://github.com/apache/spark/pull/49811 > Skip the v1 FlatMapGroupsWithState tests with timeout on big endian platforms > - > > Key: SPARK-51092 > URL: https://issues.apache.org/jira/browse/SPARK-51092 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.4, 3.5.5, 4.1.0 >Reporter: Jonathan Albrecht >Assignee: Jonathan Albrecht >Priority: Minor > Labels: big-endian, pull-request-available > Fix For: 4.0.0 > > > The {{dataType}} of the {{timestampTimeoutAttribute}} is IntegerType in > FlatMapGroupsWithStateExecHelper.scala for the v1 state manager but it is > serialized and stored in memory as a Long everywhere. > This causes the {{timestampTimeout}} value to be corrupted on big endian > platforms. This is seen in unit tests relating to the v1 state manager when > run on big endian platforms. > This can't be fixed because it would be a breaking schema change so skip the > tests instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org