[jira] [Assigned] (SPARK-51670) Refactor Intersect and Except to follow Union example to reuse in single-pass Analyzer

2025-03-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-51670:
---

Assignee: Mihailo Milosevic

> Refactor Intersect and Except to follow Union example to reuse in single-pass 
> Analyzer
> --
>
> Key: SPARK-51670
> URL: https://issues.apache.org/jira/browse/SPARK-51670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.1.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-51675) Fix issue around creating col family after db open to avoid redundant snapshot creation

2025-03-31 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-51675:


Assignee: Anish Shrigondekar

> Fix issue around creating col family after db open to avoid redundant 
> snapshot creation
> ---
>
> Key: SPARK-51675
> URL: https://issues.apache.org/jira/browse/SPARK-51675
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
>
> Fix issue around creating col family after db open



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45598) Delta table 3.0.0 not working with Spark Connect 3.5.0

2025-03-31 Thread Bobby Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939922#comment-17939922
 ] 

Bobby Wang commented on SPARK-45598:


I believe this issue has been fixed by 
https://issues.apache.org/jira/browse/SPARK-51537

> Delta table 3.0.0 not working with Spark Connect 3.5.0
> --
>
> Key: SPARK-45598
> URL: https://issues.apache.org/jira/browse/SPARK-45598
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> Spark version 3.5.0
> Spark Connect version 3.5.0
> Delta table 3.0.0
> Spark connect server was started using
> *{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages 
> org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0}}* 
> --{*}{{conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" 
> --conf 
> "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
>  --conf 
> 'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}{*}
> {{Connect client depends on}}
> *libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0"*
> *and the connect libraries*
>  
> When trying to run a simple job that writes to a delta table
> {{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{val data = spark.read.json("profiles.json")}}
> {{data.write.format("delta").save("/tmp/delta")}}
>  
> {{Error log in connect client}}
> {{Exception in thread "main" org.apache.spark.SparkException: 
> io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: 
> Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: 
> cannot assign instance of java.lang.invoke.SerializedLambda to field 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 
> in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}}
> {{    at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
> {{    at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{...}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}}
> {{    at scala.collection.Iterator.foreach(Iterator.scala:943)}}
> {{    at scala.collection.Iterator.foreach$(Iterator.scala:943)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}}
> {{    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}}
> {{    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}}
> {{    at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:

[jira] [Commented] (SPARK-46762) Spark Connect 3.5 Classloading issue with external jar

2025-03-31 Thread Bobby Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939920#comment-17939920
 ] 

Bobby Wang commented on SPARK-46762:


Hi [~tenstriker] [~snowch] , I encountered the same issue 
https://issues.apache.org/jira/browse/SPARK-51537. and I had the PR for it 
which is merged today. You can try the latest spark branch-4.0 to see if issue 
is still there.

> Spark Connect 3.5 Classloading issue with external jar
> --
>
> Key: SPARK-46762
> URL: https://issues.apache.org/jira/browse/SPARK-46762
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: nirav patel
>Priority: Major
> Attachments: Screenshot 2024-02-22 at 2.04.37 PM.png, Screenshot 
> 2024-02-22 at 2.04.49 PM.png
>
>
> We are having following `java.lang.ClassCastException` error in spark 
> Executors when using spark-connect 3.5 with external spark sql catalog jar - 
> iceberg-spark-runtime-3.5_2.12-1.4.3.jar
> We also set "spark.executor.userClassPathFirst=true" otherwise child class 
> gets loaded by MutableClassLoader and parent class gets loaded by 
> ChildFirstCLassLoader and that causes ClassCastException as well.
>  
> {code:java}
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3) (spark35-m.c.mycomp-dev-test.internal executor 2): 
> java.lang.ClassCastException: class 
> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to 
> class org.apache.iceberg.Table 
> (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed 
> module of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; 
> org.apache.iceberg.Table is in unnamed module of loader 
> org.apache.spark.util.ChildFirstURLClassLoader @4b18b943)
>     at 
> org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88)
>     at 
> org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50)
>     at 
> org.apache.iceberg.spark.source.SparkRowReaderFactory.createReader(SparkRowReaderFactory.java:45)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
>     at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:141)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>     at org.apach...{code}
>  
> `org.apache.iceberg.spark.source.SerializableTableWithSize` is a child of 
> `org.apache.iceberg.Table` and they are both in only one jar  
> `iceberg-spark-runtime-3.5_2.12-1.4.3.jar` 
> We verified that there's only one jar of 
> `iceberg-spark-runtime-3.5_2.12-1.4.3.jar` loaded when spark-connect server 
> is started. 
> Looking more into Error it seems classloader itself is instantiated multiple 
> times somewhere. I can see two instances: 
> org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053 and 
> org.apache.spark.util.ChildFirstURLClassLoader @4b18b943 
>  
> *Affected version:*
> spark 3.5 and spark-connect_2.12:3.5.0 works fine
>  
> *Not affected version and variation:*
> Spark 3.4 and spark-connect_2.12:3.4.0 works fine with external jar
> Also works with just Spark 3.5 spark-submit script directly (ie without using 
> spark-connect 3.5 )
>  
> Issue has been open with Iceberg as well: 
> [https://github.com/apache/iceberg/issues/8978]
> And been discussed in dev@org.apache.iceberg: 
> [https://lists.apache.org/thread/5q1pdqqrd1h06hgs8vx9ztt60z5yv8n1]
>  
>  
> Steps to reproduce:
>  
> 1) Just to see that spark is loading same class twice using different 
> classloader:
>  
> Start spark-connect server with required jars

[jira] [Created] (SPARK-51674) Remove unnecessary Spark Connect doc link from Spark website

2025-03-31 Thread Pengfei Xu (Jira)
Pengfei Xu created SPARK-51674:
--

 Summary: Remove unnecessary Spark Connect doc link from Spark 
website
 Key: SPARK-51674
 URL: https://issues.apache.org/jira/browse/SPARK-51674
 Project: Spark
  Issue Type: Bug
  Components: Connect, Documentation
Affects Versions: 4.1.0
Reporter: Pengfei Xu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-51666) Fix sparkStageCompleted executorRunTime metric calculation

2025-03-31 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-51666:
--

Assignee: Weichen Xu

>  Fix sparkStageCompleted executorRunTime metric calculation
> ---
>
> Key: SPARK-51666
> URL: https://issues.apache.org/jira/browse/SPARK-51666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.1.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>  Labels: pull-request-available
>
> Fix sparkStageCompleted executorRunTime metric calculation:
> In case of when a spark task uses multiple CPU’s, the CPU seconds should 
> capture the total execution seconds across all CPU’s. i.e. if a stage set 
> cpus-of-task to be 48, if we used 10 seconds on each CPU, the total CPU 
> seconds for that stage should be 10 seconds X 1 Tasks X 48 CPU = 480 
> CPU-seconds. If another task only used 1 CPU then its total CPU seconds is 10 
> seconds X 1 CPU = 10 CPU-Seconds.
> *This is very important fix since spark introduces stage level scheduling (so 
> that different stage tasks are configured with different number of CPUs) , 
> without this fix, in stage level scheduling case, the metric calculation is 
> wrong.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-51537) Failed to run third-party Spark ML library on Spark Connect

2025-03-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-51537:


Assignee: Bobby Wang

> Failed to run third-party Spark ML library on Spark Connect 
> 
>
> Key: SPARK-51537
> URL: https://issues.apache.org/jira/browse/SPARK-51537
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, ML
>Affects Versions: 4.0.0
>Reporter: Bobby Wang
>Assignee: Bobby Wang
>Priority: Major
>  Labels: pull-request-available
>
> I've encountered an issue where the third-party Spark ML library may not run 
> on Spark Connect. This problem occurs when specifying the 
> third-party ML jar using the *--jars* configuration while creating a connect 
> server 
> based on a Spark standalone cluster.
>  
> The exception thrown is a ClassCastException:
>  
> _Caused by: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance 
> of org.apache.spark.rdd.MapPartitionsRDD_
>         _at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2096)_
>         
> However, if I place the ML jar into the *$SPARK_HOME/jars* directory and 
> restart both the Spark standalone cluster and the Spark Connect server, it 
> runs without any exceptions.
>  
> Alternatively, adding 
> *spark.addArtifacts("target/com.example.ml-1.0-SNAPSHOT.jar")* directly in 
> the python code also resolves the issue.
>  
> I have made a minimum project which can repro this issue, more details could 
> be found at [https://github.com/wbo4958/ConnectMLIssue] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-51537) Failed to run third-party Spark ML library on Spark Connect

2025-03-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-51537.
--
Fix Version/s: 4.1.0
   Resolution: Fixed

Issue resolved by pull request 50334
[https://github.com/apache/spark/pull/50334]

> Failed to run third-party Spark ML library on Spark Connect 
> 
>
> Key: SPARK-51537
> URL: https://issues.apache.org/jira/browse/SPARK-51537
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, ML
>Affects Versions: 4.0.0
>Reporter: Bobby Wang
>Assignee: Bobby Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> I've encountered an issue where the third-party Spark ML library may not run 
> on Spark Connect. This problem occurs when specifying the 
> third-party ML jar using the *--jars* configuration while creating a connect 
> server 
> based on a Spark standalone cluster.
>  
> The exception thrown is a ClassCastException:
>  
> _Caused by: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance 
> of org.apache.spark.rdd.MapPartitionsRDD_
>         _at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2096)_
>         
> However, if I place the ML jar into the *$SPARK_HOME/jars* directory and 
> restart both the Spark standalone cluster and the Spark Connect server, it 
> runs without any exceptions.
>  
> Alternatively, adding 
> *spark.addArtifacts("target/com.example.ml-1.0-SNAPSHOT.jar")* directly in 
> the python code also resolves the issue.
>  
> I have made a minimum project which can repro this issue, more details could 
> be found at [https://github.com/wbo4958/ConnectMLIssue] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51537) Failed to run third-party Spark ML library on Spark Connect

2025-03-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-51537:
-
Fix Version/s: 4.0.0
   (was: 4.1.0)

> Failed to run third-party Spark ML library on Spark Connect 
> 
>
> Key: SPARK-51537
> URL: https://issues.apache.org/jira/browse/SPARK-51537
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, ML
>Affects Versions: 4.0.0
>Reporter: Bobby Wang
>Assignee: Bobby Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> I've encountered an issue where the third-party Spark ML library may not run 
> on Spark Connect. This problem occurs when specifying the 
> third-party ML jar using the *--jars* configuration while creating a connect 
> server 
> based on a Spark standalone cluster.
>  
> The exception thrown is a ClassCastException:
>  
> _Caused by: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance 
> of org.apache.spark.rdd.MapPartitionsRDD_
>         _at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2096)_
>         
> However, if I place the ML jar into the *$SPARK_HOME/jars* directory and 
> restart both the Spark standalone cluster and the Spark Connect server, it 
> runs without any exceptions.
>  
> Alternatively, adding 
> *spark.addArtifacts("target/com.example.ml-1.0-SNAPSHOT.jar")* directly in 
> the python code also resolves the issue.
>  
> I have made a minimum project which can repro this issue, more details could 
> be found at [https://github.com/wbo4958/ConnectMLIssue] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-51667) [TWS + Python] Disable Nagle's algorithm between Python worker and State Server

2025-03-31 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-51667:


Assignee: Jungtaek Lim

> [TWS + Python] Disable Nagle's algorithm between Python worker and State 
> Server
> ---
>
> Key: SPARK-51667
> URL: https://issues.apache.org/jira/browse/SPARK-51667
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0, 4.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>  Labels: pull-request-available
>
> During testing TWS + Python, we figured out the case where the socket 
> communication for state interaction had delayed for more than 40ms, for 
> certain type of state, e.g. ListState.put(), ListState.get(), 
> ListState.appendList(), etcetc.
> The root cause is figured out as the combination of Nagle's algorithm and 
> delayed ACK. The sequence is following:
>  # Python worker sends the proto message to JVM, and flushes the socket.
>  # Additionally, Python worker sends the follow-up data to JVM, and flushes 
> the socket.
>  # JVM reads the proto message, and realizes there is follow-up data.
>  # JVM reads the follow-up data.
>  # JVM processes the request, and sends the response back to Python worker.
> Due to delayed ACK, even after 3, ACK is not sent back from JVM to Python 
> worker. It is waiting for some data or multiple ACKs to be sent, but JVM is 
> not going to send the data during that phase.
> Due to Nagle's algorithm, the message from 2 is not sent to JVM since there 
> is no ACK for the message from 1.
> This deadlock situation is resolved after the timeout of delayed ACK, which 
> is 40ms (minimum duration) in Linux. After the timeout, ACK is sent back from 
> JVM to Python worker, hence Nagle's algorithm allows the message from 2 to be 
> finally sent to JVM.
> See below articles for more general explanation:
>  * [https://engineering.avast.io/40-millisecond-bug/]
>  ** Start reading from Nagle's algorithm section
>  * [https://brooker.co.za/blog/2024/05/09/nagle.html]
> Nagle's algorithm helps to reduce a lot of small packets, which the above 
> article states it could help the router from overloaded. We connect to 
> "localhost" here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-51676) Support `printSchema` for `DataFrame`

2025-03-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-51676.
---
Fix Version/s: connect-swift-0.1.0
   Resolution: Fixed

Issue resolved by pull request 35
[https://github.com/apache/spark-connect-swift/pull/35]

> Support `printSchema` for `DataFrame`
> -
>
> Key: SPARK-51676
> URL: https://issues.apache.org/jira/browse/SPARK-51676
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: connect-swift-0.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: connect-swift-0.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-51676) Support `printSchema` for `DataFrame`

2025-03-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-51676:
-

Assignee: Dongjoon Hyun

> Support `printSchema` for `DataFrame`
> -
>
> Key: SPARK-51676
> URL: https://issues.apache.org/jira/browse/SPARK-51676
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: connect-swift-0.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51637) Python Data Sources columnar read

2025-03-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51637:
---
Labels: pull-request-available  (was: )

> Python Data Sources columnar read
> -
>
> Key: SPARK-51637
> URL: https://issues.apache.org/jira/browse/SPARK-51637
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.1.0
>Reporter: Haoyu Weng
>Priority: Major
>  Labels: pull-request-available
>
> Use PartitionReader[ColumnarBatch] in scala for improved performance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51676) Support `printSchema` for `DataFrame`

2025-03-31 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-51676:
-

 Summary: Support `printSchema` for `DataFrame`
 Key: SPARK-51676
 URL: https://issues.apache.org/jira/browse/SPARK-51676
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: connect-swift-0.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51676) Support `printSchema` for `DataFrame`

2025-03-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51676:
---
Labels: pull-request-available  (was: )

> Support `printSchema` for `DataFrame`
> -
>
> Key: SPARK-51676
> URL: https://issues.apache.org/jira/browse/SPARK-51676
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: connect-swift-0.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51674) Remove unnecessary Spark Connect doc link from Spark website

2025-03-31 Thread Pengfei Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengfei Xu updated SPARK-51674:
---
Description: 
We now have all Classic-only APIs annotated with @ClassicOnly. This tag will be 
visible in the main Spark API doc. Therefore a separate doc for Spark Connect 
is not necessary.
e should revert SPARK-51288.

> Remove unnecessary Spark Connect doc link from Spark website
> 
>
> Key: SPARK-51674
> URL: https://issues.apache.org/jira/browse/SPARK-51674
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Documentation
>Affects Versions: 4.1.0
>Reporter: Pengfei Xu
>Priority: Major
>
> We now have all Classic-only APIs annotated with @ClassicOnly. This tag will 
> be visible in the main Spark API doc. Therefore a separate doc for Spark 
> Connect is not necessary.
> e should revert SPARK-51288.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51674) Remove unnecessary Spark Connect doc link from Spark website

2025-03-31 Thread Pengfei Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengfei Xu updated SPARK-51674:
---
Description: 
We now have all Classic-only APIs annotated with @ClassicOnly. This tag will be 
visible in the main Spark API doc. Therefore a separate doc for Spark Connect 
is not necessary.
We should revert SPARK-51288.

  was:
We now have all Classic-only APIs annotated with @ClassicOnly. This tag will be 
visible in the main Spark API doc. Therefore a separate doc for Spark Connect 
is not necessary.
e should revert SPARK-51288.


> Remove unnecessary Spark Connect doc link from Spark website
> 
>
> Key: SPARK-51674
> URL: https://issues.apache.org/jira/browse/SPARK-51674
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Documentation
>Affects Versions: 4.1.0
>Reporter: Pengfei Xu
>Priority: Major
>
> We now have all Classic-only APIs annotated with @ClassicOnly. This tag will 
> be visible in the main Spark API doc. Therefore a separate doc for Spark 
> Connect is not necessary.
> We should revert SPARK-51288.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-50131) Update isin to accept DataFrame to work as IN subquery

2025-03-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-50131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-50131:
---
Labels: pull-request-available  (was: )

> Update isin to accept DataFrame to work as IN subquery
> --
>
> Key: SPARK-50131
> URL: https://issues.apache.org/jira/browse/SPARK-50131
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-51657) UTF8_BINARY default table collation shown by default in Desc As JSON (v1)

2025-03-31 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-51657:
--

Assignee: Amanda Liu

> UTF8_BINARY default table collation shown by default in Desc As JSON (v1)
> -
>
> Key: SPARK-51657
> URL: https://issues.apache.org/jira/browse/SPARK-51657
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Amanda Liu
>Assignee: Amanda Liu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-51657) UTF8_BINARY default table collation shown by default in Desc As JSON (v1)

2025-03-31 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-51657.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 50451
[https://github.com/apache/spark/pull/50451]

> UTF8_BINARY default table collation shown by default in Desc As JSON (v1)
> -
>
> Key: SPARK-51657
> URL: https://issues.apache.org/jira/browse/SPARK-51657
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Amanda Liu
>Assignee: Amanda Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-51670) Refactor Intersect and Except to follow Union example to reuse in single-pass Analyzer

2025-03-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-51670.
-
Fix Version/s: 4.1.0
   Resolution: Fixed

Issue resolved by pull request 50465
[https://github.com/apache/spark/pull/50465]

> Refactor Intersect and Except to follow Union example to reuse in single-pass 
> Analyzer
> --
>
> Key: SPARK-51670
> URL: https://issues.apache.org/jira/browse/SPARK-51670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.1.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-51650) Support delete ml cached objects in batch

2025-03-31 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-51650:
-

Assignee: Ruifeng Zheng

> Support delete ml cached objects in batch
> -
>
> Key: SPARK-51650
> URL: https://issues.apache.org/jira/browse/SPARK-51650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51674) Remove unnecessary Spark Connect doc link from Spark website

2025-03-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51674:
---
Labels: pull-request-available  (was: )

> Remove unnecessary Spark Connect doc link from Spark website
> 
>
> Key: SPARK-51674
> URL: https://issues.apache.org/jira/browse/SPARK-51674
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Documentation
>Affects Versions: 4.1.0
>Reporter: Pengfei Xu
>Priority: Major
>  Labels: pull-request-available
>
> We now have all Classic-only APIs annotated with @ClassicOnly. This tag will 
> be visible in the main Spark API doc. Therefore a separate doc for Spark 
> Connect is not necessary.
> We should revert SPARK-51288.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51675) Fix issue around creating col family after db open

2025-03-31 Thread Anish Shrigondekar (Jira)
Anish Shrigondekar created SPARK-51675:
--

 Summary: Fix issue around creating col family after db open
 Key: SPARK-51675
 URL: https://issues.apache.org/jira/browse/SPARK-51675
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Anish Shrigondekar


Fix issue around creating col family after db open



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51668) Collect metrics for data source V2 in case the command fails

2025-03-31 Thread Jan-Ole Sasse (Jira)
Jan-Ole Sasse created SPARK-51668:
-

 Summary: Collect metrics for data source V2 in case the command 
fails
 Key: SPARK-51668
 URL: https://issues.apache.org/jira/browse/SPARK-51668
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jan-Ole Sasse






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33598) Support Java Class with circular references

2025-03-31 Thread Tim Robertson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939572#comment-17939572
 ] 

Tim Robertson commented on SPARK-33598:
---

Chiming in just to say I've just encountered this while porting some code from 
Avro to POJOs in a Spark 3.4.x environment.

Bean A has references to Bean B, which in turn has a field of type Bean A.

 

 

> Support Java Class with circular references
> ---
>
> Key: SPARK-33598
> URL: https://issues.apache.org/jira/browse/SPARK-33598
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 3.1.2
>Reporter: jacklzg
>Priority: Minor
>
> If the target Java data class has a circular reference, Spark will fail fast 
> from creating the Dataset or running Encoders.
>  
> For example, with protobuf class, there is a reference with Descriptor, there 
> is no way to build a dataset from the protobuf class.
> From this line
> {color:#7a869a}Encoders.bean(ProtoBuffOuterClass.ProtoBuff.class);{color}
>  
> It will throw out immediately
>  
> {quote}Exception in thread "main" java.lang.UnsupportedOperationException: 
> Cannot have circular references in bean class, but got the circular reference 
> of class class com.google.protobuf.Descriptors$Descriptor
> {quote}
>  
> Can we add  a parameter, for example, 
>  
> {code:java}
> Encoders.bean(Class clas, List fieldsToIgnore);{code}
> 
> or
>  
> {code:java}
> Encoders.bean(Class clas, boolean skipCircularRefField);{code}
>  
>  which subsequently, instead of throwing an exception @ 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L556],
>  it instead skip the field.
>  
> {code:java}
> if (seenTypeSet.contains(t)) {
> if(skipCircularRefField)
>   println("field skipped") //just skip this field
> else throw new UnsupportedOperationException( s"cannot have circular 
> references in class, but got the circular reference of class $t")
> }
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51672) Regenerate golden files with collation aliases in branch-4.0

2025-03-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51672:
---
Labels: pull-request-available  (was: )

> Regenerate golden files with collation aliases in branch-4.0
> 
>
> Key: SPARK-51672
> URL: https://issues.apache.org/jira/browse/SPARK-51672
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Major
>  Labels: pull-request-available
>
> Regenerate golden files with collation aliases in branch-4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51669) Generate random TIME values in tests

2025-03-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51669:
---
Labels: pull-request-available  (was: )

> Generate random TIME values in tests
> 
>
> Key: SPARK-51669
> URL: https://issues.apache.org/jira/browse/SPARK-51669
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.1.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
>
> Extend RandomDataGenerator to support the new data type TIME.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-51669) Generate random TIME values in tests

2025-03-31 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-51669.
--
Fix Version/s: 4.1.0
   Resolution: Fixed

Issue resolved by pull request 50462
[https://github.com/apache/spark/pull/50462]

> Generate random TIME values in tests
> 
>
> Key: SPARK-51669
> URL: https://issues.apache.org/jira/browse/SPARK-51669
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.1.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Extend RandomDataGenerator to support the new data type TIME.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-51664) Support the TIME data type in the Hash expression

2025-03-31 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-51664.
--
Fix Version/s: 4.1.0
   Resolution: Fixed

Issue resolved by pull request 50456
[https://github.com/apache/spark/pull/50456]

> Support the TIME data type in the Hash expression
> -
>
> Key: SPARK-51664
> URL: https://issues.apache.org/jira/browse/SPARK-51664
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.1.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Modify the HashExpression expression to support the TIME data type. In 
> particular, the following classes are affected:
> - Murmur3Hash: generating partition ID,
> - HiveHash: bucking
> - XxHash64: Bloom filter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51669) Generate random TIME values in tests

2025-03-31 Thread Max Gekk (Jira)
Max Gekk created SPARK-51669:


 Summary: Generate random TIME values in tests
 Key: SPARK-51669
 URL: https://issues.apache.org/jira/browse/SPARK-51669
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.1.0
Reporter: Max Gekk
Assignee: Max Gekk


Extend RandomDataGenerator to support the new data type TIME.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-51663) Short circuit && operation for JoinSelectionHelper

2025-03-31 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-51663.

Fix Version/s: 4.1.0
   Resolution: Fixed

> Short circuit && operation for JoinSelectionHelper
> --
>
> Key: SPARK-51663
> URL: https://issues.apache.org/jira/browse/SPARK-51663
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.1.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36600) Optimise speed and memory with Pyspark when create DataFrame (with patch)

2025-03-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-36600:
---
Labels: easyfix pull-request-available  (was: easyfix)

> Optimise speed and memory with Pyspark when create DataFrame (with patch)
> -
>
> Key: SPARK-36600
> URL: https://issues.apache.org/jira/browse/SPARK-36600
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Philippe Prados
>Priority: Trivial
>  Labels: easyfix, pull-request-available
> Attachments: optimize_memory_pyspark.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The Python method {{SparkSession._createFromLocal()}} start to the data, and 
> create a list if it's not an instance of list. But it is necessary only if 
> the scheme is not present.
> {quote}# make sure data could consumed multiple times
>  if not isinstance(data, list):
>   data = list(data)
> {quote}
> If you use {{createDataFrame(data=_a_generator_,...)}}, all the datas were 
> save in memory in a list, then convert to a row in memory, then convert to 
> buffer in pickle format, etc.
> Two lists were present at the same time in memory. The list created by 
> _createFromLocal() and the list created later with
> {quote}# convert python objects to sql data
> data = [schema.toInternal(row) for row in data]
> {quote}
> The purpose of using a generator is to reduce the memory footprint when the 
> data are dynamically build.
> {quote}def _createFromLocal(self, data, schema):
>   """
>   Create an RDD for DataFrame from a list or pandas.DataFrame, returns
>   the RDD and schema.
>   """
>   if schema is None or isinstance(schema, (list, tuple)):
>     *# make sure data could consumed multiple times*
>     *if inspect.isgeneratorfunction(data):* 
>       *data = list(data)*
>     struct = self._inferSchemaFromList(data, names=schema)
>     converter = _create_converter(struct)
>     data = map(converter, data)
>     if isinstance(schema, (list, tuple)):
>       for i, name in enumerate(schema):
>         struct.fields[i].name = name
>         struct.names[i] = name
>       schema = struct
>     elif not isinstance(schema, StructType):
>       raise TypeError("schema should be StructType or list or None, but got: 
> %s" % schema)
>   # convert python objects to sql data
>   data = [schema.toInternal(row) for row in data]
>   return self._sc.parallelize(data), schema{quote}
> Then, it is interesting to use a generator.
>  
> {quote}The patch:
> diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
> index 57c680fd04..0dba590451 100644
> --- a/python/pyspark/sql/session.py
> +++ b/python/pyspark/sql/session.py
> @@ -15,6 +15,7 @@
>  # limitations under the License.
>  #
>  
> +import inspect
>  import sys
>  import warnings
>  from functools import reduce
> @@ -504,11 +505,11 @@ class SparkSession(SparkConversionMixin):
>  Create an RDD for DataFrame from a list or pandas.DataFrame, returns
>  the RDD and schema.
>  """
> - # make sure data could consumed multiple times
> - if not isinstance(data, list):
> - data = list(data)
>  
>  if schema is None or isinstance(schema, (list, tuple)):
> + # make sure data could consumed multiple times
> + if inspect.isgeneratorfunction(data): # PPR
> + data = list(data)
>  struct = self._inferSchemaFromList(data, names=schema)
>  converter = _create_converter(struct)
>  data = map(converter, data)
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51671) Add column pruning to Recursive CTEs

2025-03-31 Thread Jira
Pavle Martinović created SPARK-51671:


 Summary: Add column pruning to Recursive CTEs
 Key: SPARK-51671
 URL: https://issues.apache.org/jira/browse/SPARK-51671
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.1.0
Reporter: Pavle Martinović






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-51673) Apply default collation for Alter view query

2025-03-31 Thread Marko Ilic (Jira)
Marko Ilic created SPARK-51673:
--

 Summary: Apply default collation for Alter view query
 Key: SPARK-51673
 URL: https://issues.apache.org/jira/browse/SPARK-51673
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Marko Ilic


Default collation is not applied when alter view is done.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51673) Apply default collation for Alter view query

2025-03-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51673:
---
Labels: pull-request-available  (was: )

> Apply default collation for Alter view query
> 
>
> Key: SPARK-51673
> URL: https://issues.apache.org/jira/browse/SPARK-51673
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Marko Ilic
>Priority: Major
>  Labels: pull-request-available
>
> Default collation is not applied when alter view is done.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30451) Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have functions to remove requests

2025-03-31 Thread Jayadeep Jayaraman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939692#comment-17939692
 ] 

Jayadeep Jayaraman commented on SPARK-30451:


Hi [~tgraves] - I would like to work on this request. As per the description my 
understanding is
{code:java}
class ExecutorResourceRequests() {code}
is a convenience method for the class 
{code:java}
class ExecutorResourceRequest(){code}
 Class `ExecutorResourceRequests() ` has multiple methods to add different 
resources
{code:java}
def offHeapMemory(amount: String): this.type

def memoryOverhead(amount: String): this.type

def memory(amount: String): this.type{code}
and they are stored in a mutable map
{code:java}
new ConcurrentHashMap[String, ExecutorResourceRequest](){code}
 

The goal of this PR is to offer an API that allows users to build a Request 
profile once as shown below
{code:java}
 
val executorReqs: ExecutorResourceRequests = ExecutorResourceRequests.builder() 
.memory("1g") .cores(4) .resource("gpu", 2) .build()
{code}
 

 

and then offer ways to add or remove specific resources as shown below
{code:java}
 
val anotherReqs: ExecutorResourceRequests = ExecutorResourceRequests.builder() 
.offHeapMemory("512m") .add(new ExecutorResourceRequest("fpga", 1)) 
.remove("memory") // This would have no effect as 'memory' wasn't added in this 
builder .build()
{code}
 

 

Kindly, let me know if my understanding is correct and I would like to work on 
this task

 

> Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have 
> functions to remove requests
> -
>
> Key: SPARK-30451
> URL: https://issues.apache.org/jira/browse/SPARK-30451
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have 
> functions to remove requests
> Currently in the design ExecutorResourceRequests and TaskREsourceRequests are 
> mutable and users can update as they want. It would make sense to add api's 
> to remove certain resource requirements from them. This would allow a user to 
> create one ExecutorResourceRequests object and then if they want to just 
> add/remove something from it they easily could without having to recreate all 
> the requests in that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30451) Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have functions to remove requests

2025-03-31 Thread Jayadeep Jayaraman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939692#comment-17939692
 ] 

Jayadeep Jayaraman edited comment on SPARK-30451 at 3/31/25 1:23 PM:
-

Hi [~tgraves] - I would like to work on this request. As per the description my 
understanding is
{code:java}
class ExecutorResourceRequests() {code}
is a convenience method for the class 
{code:java}
class ExecutorResourceRequest(){code}
 Class `ExecutorResourceRequests() ` has multiple methods to add different 
resources
{code:java}
def offHeapMemory(amount: String): this.type

def memoryOverhead(amount: String): this.type

def memory(amount: String): this.type{code}
and they are stored in a mutable map
{code:java}
new ConcurrentHashMap[String, ExecutorResourceRequest](){code}
The goal of this PR is to offer an API that allows users to build a Request 
profile once as shown below
{code:java}
 
val executorReqs: ExecutorResourceRequests = ExecutorResourceRequests.builder() 
.memory("1g") .cores(4) .resource("gpu", 2) .build()
{code}
and then offer ways to add or remove specific resources as shown below
{code:java}
 
val anotherReqs: ExecutorResourceRequests = ExecutorResourceRequests.builder() 
.offHeapMemory("512m") .add(new ExecutorResourceRequest("fpga", 1)) 
.remove("memory") // This would have no effect as 'memory' wasn't added in this 
builder .build()
{code}
Kindly, let me know if my understanding is correct and I would like to work on 
this task

 


was (Author: jjayadeep):
Hi [~tgraves] - I would like to work on this request. As per the description my 
understanding is
{code:java}
class ExecutorResourceRequests() {code}
is a convenience method for the class 
{code:java}
class ExecutorResourceRequest(){code}
 Class `ExecutorResourceRequests() ` has multiple methods to add different 
resources
{code:java}
def offHeapMemory(amount: String): this.type

def memoryOverhead(amount: String): this.type

def memory(amount: String): this.type{code}
and they are stored in a mutable map
{code:java}
new ConcurrentHashMap[String, ExecutorResourceRequest](){code}
 

The goal of this PR is to offer an API that allows users to build a Request 
profile once as shown below
{code:java}
 
val executorReqs: ExecutorResourceRequests = ExecutorResourceRequests.builder() 
.memory("1g") .cores(4) .resource("gpu", 2) .build()
{code}
 

 

and then offer ways to add or remove specific resources as shown below
{code:java}
 
val anotherReqs: ExecutorResourceRequests = ExecutorResourceRequests.builder() 
.offHeapMemory("512m") .add(new ExecutorResourceRequest("fpga", 1)) 
.remove("memory") // This would have no effect as 'memory' wasn't added in this 
builder .build()
{code}
 

 

Kindly, let me know if my understanding is correct and I would like to work on 
this task

 

> Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have 
> functions to remove requests
> -
>
> Key: SPARK-30451
> URL: https://issues.apache.org/jira/browse/SPARK-30451
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have 
> functions to remove requests
> Currently in the design ExecutorResourceRequests and TaskREsourceRequests are 
> mutable and users can update as they want. It would make sense to add api's 
> to remove certain resource requirements from them. This would allow a user to 
> create one ExecutorResourceRequests object and then if they want to just 
> add/remove something from it they easily could without having to recreate all 
> the requests in that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-51670) Refactor Intersect and Except to follow Union example to reuse in single-pass Analyzer

2025-03-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-51670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51670:
---
Labels: pull-request-available  (was: )

> Refactor Intersect and Except to follow Union example to reuse in single-pass 
> Analyzer
> --
>
> Key: SPARK-51670
> URL: https://issues.apache.org/jira/browse/SPARK-51670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.1.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org