[jira] [Assigned] (SPARK-51670) Refactor Intersect and Except to follow Union example to reuse in single-pass Analyzer
[ https://issues.apache.org/jira/browse/SPARK-51670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-51670: --- Assignee: Mihailo Milosevic > Refactor Intersect and Except to follow Union example to reuse in single-pass > Analyzer > -- > > Key: SPARK-51670 > URL: https://issues.apache.org/jira/browse/SPARK-51670 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.1.0 >Reporter: Mihailo Milosevic >Assignee: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-51675) Fix issue around creating col family after db open to avoid redundant snapshot creation
[ https://issues.apache.org/jira/browse/SPARK-51675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-51675: Assignee: Anish Shrigondekar > Fix issue around creating col family after db open to avoid redundant > snapshot creation > --- > > Key: SPARK-51675 > URL: https://issues.apache.org/jira/browse/SPARK-51675 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Labels: pull-request-available > > Fix issue around creating col family after db open -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45598) Delta table 3.0.0 not working with Spark Connect 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939922#comment-17939922 ] Bobby Wang commented on SPARK-45598: I believe this issue has been fixed by https://issues.apache.org/jira/browse/SPARK-51537 > Delta table 3.0.0 not working with Spark Connect 3.5.0 > -- > > Key: SPARK-45598 > URL: https://issues.apache.org/jira/browse/SPARK-45598 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Faiz Halde >Priority: Major > > Spark version 3.5.0 > Spark Connect version 3.5.0 > Delta table 3.0.0 > Spark connect server was started using > *{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages > org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0}}* > --{*}{{conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" > --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > --conf > 'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}{*} > {{Connect client depends on}} > *libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0"* > *and the connect libraries* > > When trying to run a simple job that writes to a delta table > {{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}} > {{val data = spark.read.json("profiles.json")}} > {{data.write.format("delta").save("/tmp/delta")}} > > {{Error log in connect client}} > {{Exception in thread "main" org.apache.spark.SparkException: > io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: > Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: > cannot assign instance of java.lang.invoke.SerializedLambda to field > org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 > in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}} > {{ at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}} > {{ at > java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{...}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}} > {{ at scala.collection.Iterator.foreach(Iterator.scala:943)}} > {{ at scala.collection.Iterator.foreach$(Iterator.scala:943)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}} > {{ at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}} > {{ at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}} > {{ at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:
[jira] [Commented] (SPARK-46762) Spark Connect 3.5 Classloading issue with external jar
[ https://issues.apache.org/jira/browse/SPARK-46762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939920#comment-17939920 ] Bobby Wang commented on SPARK-46762: Hi [~tenstriker] [~snowch] , I encountered the same issue https://issues.apache.org/jira/browse/SPARK-51537. and I had the PR for it which is merged today. You can try the latest spark branch-4.0 to see if issue is still there. > Spark Connect 3.5 Classloading issue with external jar > -- > > Key: SPARK-46762 > URL: https://issues.apache.org/jira/browse/SPARK-46762 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: nirav patel >Priority: Major > Attachments: Screenshot 2024-02-22 at 2.04.37 PM.png, Screenshot > 2024-02-22 at 2.04.49 PM.png > > > We are having following `java.lang.ClassCastException` error in spark > Executors when using spark-connect 3.5 with external spark sql catalog jar - > iceberg-spark-runtime-3.5_2.12-1.4.3.jar > We also set "spark.executor.userClassPathFirst=true" otherwise child class > gets loaded by MutableClassLoader and parent class gets loaded by > ChildFirstCLassLoader and that causes ClassCastException as well. > > {code:java} > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 3) (spark35-m.c.mycomp-dev-test.internal executor 2): > java.lang.ClassCastException: class > org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to > class org.apache.iceberg.Table > (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed > module of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; > org.apache.iceberg.Table is in unnamed module of loader > org.apache.spark.util.ChildFirstURLClassLoader @4b18b943) > at > org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88) > at > org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50) > at > org.apache.iceberg.spark.source.SparkRowReaderFactory.createReader(SparkRowReaderFactory.java:45) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:141) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) > at org.apach...{code} > > `org.apache.iceberg.spark.source.SerializableTableWithSize` is a child of > `org.apache.iceberg.Table` and they are both in only one jar > `iceberg-spark-runtime-3.5_2.12-1.4.3.jar` > We verified that there's only one jar of > `iceberg-spark-runtime-3.5_2.12-1.4.3.jar` loaded when spark-connect server > is started. > Looking more into Error it seems classloader itself is instantiated multiple > times somewhere. I can see two instances: > org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053 and > org.apache.spark.util.ChildFirstURLClassLoader @4b18b943 > > *Affected version:* > spark 3.5 and spark-connect_2.12:3.5.0 works fine > > *Not affected version and variation:* > Spark 3.4 and spark-connect_2.12:3.4.0 works fine with external jar > Also works with just Spark 3.5 spark-submit script directly (ie without using > spark-connect 3.5 ) > > Issue has been open with Iceberg as well: > [https://github.com/apache/iceberg/issues/8978] > And been discussed in dev@org.apache.iceberg: > [https://lists.apache.org/thread/5q1pdqqrd1h06hgs8vx9ztt60z5yv8n1] > > > Steps to reproduce: > > 1) Just to see that spark is loading same class twice using different > classloader: > > Start spark-connect server with required jars
[jira] [Created] (SPARK-51674) Remove unnecessary Spark Connect doc link from Spark website
Pengfei Xu created SPARK-51674: -- Summary: Remove unnecessary Spark Connect doc link from Spark website Key: SPARK-51674 URL: https://issues.apache.org/jira/browse/SPARK-51674 Project: Spark Issue Type: Bug Components: Connect, Documentation Affects Versions: 4.1.0 Reporter: Pengfei Xu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-51666) Fix sparkStageCompleted executorRunTime metric calculation
[ https://issues.apache.org/jira/browse/SPARK-51666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-51666: -- Assignee: Weichen Xu > Fix sparkStageCompleted executorRunTime metric calculation > --- > > Key: SPARK-51666 > URL: https://issues.apache.org/jira/browse/SPARK-51666 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.1.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Labels: pull-request-available > > Fix sparkStageCompleted executorRunTime metric calculation: > In case of when a spark task uses multiple CPU’s, the CPU seconds should > capture the total execution seconds across all CPU’s. i.e. if a stage set > cpus-of-task to be 48, if we used 10 seconds on each CPU, the total CPU > seconds for that stage should be 10 seconds X 1 Tasks X 48 CPU = 480 > CPU-seconds. If another task only used 1 CPU then its total CPU seconds is 10 > seconds X 1 CPU = 10 CPU-Seconds. > *This is very important fix since spark introduces stage level scheduling (so > that different stage tasks are configured with different number of CPUs) , > without this fix, in stage level scheduling case, the metric calculation is > wrong.* -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-51537) Failed to run third-party Spark ML library on Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-51537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-51537: Assignee: Bobby Wang > Failed to run third-party Spark ML library on Spark Connect > > > Key: SPARK-51537 > URL: https://issues.apache.org/jira/browse/SPARK-51537 > Project: Spark > Issue Type: Bug > Components: Connect, ML >Affects Versions: 4.0.0 >Reporter: Bobby Wang >Assignee: Bobby Wang >Priority: Major > Labels: pull-request-available > > I've encountered an issue where the third-party Spark ML library may not run > on Spark Connect. This problem occurs when specifying the > third-party ML jar using the *--jars* configuration while creating a connect > server > based on a Spark standalone cluster. > > The exception thrown is a ClassCastException: > > _Caused by: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance > of org.apache.spark.rdd.MapPartitionsRDD_ > _at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2096)_ > > However, if I place the ML jar into the *$SPARK_HOME/jars* directory and > restart both the Spark standalone cluster and the Spark Connect server, it > runs without any exceptions. > > Alternatively, adding > *spark.addArtifacts("target/com.example.ml-1.0-SNAPSHOT.jar")* directly in > the python code also resolves the issue. > > I have made a minimum project which can repro this issue, more details could > be found at [https://github.com/wbo4958/ConnectMLIssue] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-51537) Failed to run third-party Spark ML library on Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-51537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-51537. -- Fix Version/s: 4.1.0 Resolution: Fixed Issue resolved by pull request 50334 [https://github.com/apache/spark/pull/50334] > Failed to run third-party Spark ML library on Spark Connect > > > Key: SPARK-51537 > URL: https://issues.apache.org/jira/browse/SPARK-51537 > Project: Spark > Issue Type: Bug > Components: Connect, ML >Affects Versions: 4.0.0 >Reporter: Bobby Wang >Assignee: Bobby Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.1.0 > > > I've encountered an issue where the third-party Spark ML library may not run > on Spark Connect. This problem occurs when specifying the > third-party ML jar using the *--jars* configuration while creating a connect > server > based on a Spark standalone cluster. > > The exception thrown is a ClassCastException: > > _Caused by: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance > of org.apache.spark.rdd.MapPartitionsRDD_ > _at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2096)_ > > However, if I place the ML jar into the *$SPARK_HOME/jars* directory and > restart both the Spark standalone cluster and the Spark Connect server, it > runs without any exceptions. > > Alternatively, adding > *spark.addArtifacts("target/com.example.ml-1.0-SNAPSHOT.jar")* directly in > the python code also resolves the issue. > > I have made a minimum project which can repro this issue, more details could > be found at [https://github.com/wbo4958/ConnectMLIssue] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51537) Failed to run third-party Spark ML library on Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-51537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-51537: - Fix Version/s: 4.0.0 (was: 4.1.0) > Failed to run third-party Spark ML library on Spark Connect > > > Key: SPARK-51537 > URL: https://issues.apache.org/jira/browse/SPARK-51537 > Project: Spark > Issue Type: Bug > Components: Connect, ML >Affects Versions: 4.0.0 >Reporter: Bobby Wang >Assignee: Bobby Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > I've encountered an issue where the third-party Spark ML library may not run > on Spark Connect. This problem occurs when specifying the > third-party ML jar using the *--jars* configuration while creating a connect > server > based on a Spark standalone cluster. > > The exception thrown is a ClassCastException: > > _Caused by: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance > of org.apache.spark.rdd.MapPartitionsRDD_ > _at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2096)_ > > However, if I place the ML jar into the *$SPARK_HOME/jars* directory and > restart both the Spark standalone cluster and the Spark Connect server, it > runs without any exceptions. > > Alternatively, adding > *spark.addArtifacts("target/com.example.ml-1.0-SNAPSHOT.jar")* directly in > the python code also resolves the issue. > > I have made a minimum project which can repro this issue, more details could > be found at [https://github.com/wbo4958/ConnectMLIssue] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-51667) [TWS + Python] Disable Nagle's algorithm between Python worker and State Server
[ https://issues.apache.org/jira/browse/SPARK-51667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-51667: Assignee: Jungtaek Lim > [TWS + Python] Disable Nagle's algorithm between Python worker and State > Server > --- > > Key: SPARK-51667 > URL: https://issues.apache.org/jira/browse/SPARK-51667 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0, 4.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Labels: pull-request-available > > During testing TWS + Python, we figured out the case where the socket > communication for state interaction had delayed for more than 40ms, for > certain type of state, e.g. ListState.put(), ListState.get(), > ListState.appendList(), etcetc. > The root cause is figured out as the combination of Nagle's algorithm and > delayed ACK. The sequence is following: > # Python worker sends the proto message to JVM, and flushes the socket. > # Additionally, Python worker sends the follow-up data to JVM, and flushes > the socket. > # JVM reads the proto message, and realizes there is follow-up data. > # JVM reads the follow-up data. > # JVM processes the request, and sends the response back to Python worker. > Due to delayed ACK, even after 3, ACK is not sent back from JVM to Python > worker. It is waiting for some data or multiple ACKs to be sent, but JVM is > not going to send the data during that phase. > Due to Nagle's algorithm, the message from 2 is not sent to JVM since there > is no ACK for the message from 1. > This deadlock situation is resolved after the timeout of delayed ACK, which > is 40ms (minimum duration) in Linux. After the timeout, ACK is sent back from > JVM to Python worker, hence Nagle's algorithm allows the message from 2 to be > finally sent to JVM. > See below articles for more general explanation: > * [https://engineering.avast.io/40-millisecond-bug/] > ** Start reading from Nagle's algorithm section > * [https://brooker.co.za/blog/2024/05/09/nagle.html] > Nagle's algorithm helps to reduce a lot of small packets, which the above > article states it could help the router from overloaded. We connect to > "localhost" here. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-51676) Support `printSchema` for `DataFrame`
[ https://issues.apache.org/jira/browse/SPARK-51676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-51676. --- Fix Version/s: connect-swift-0.1.0 Resolution: Fixed Issue resolved by pull request 35 [https://github.com/apache/spark-connect-swift/pull/35] > Support `printSchema` for `DataFrame` > - > > Key: SPARK-51676 > URL: https://issues.apache.org/jira/browse/SPARK-51676 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: connect-swift-0.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: connect-swift-0.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-51676) Support `printSchema` for `DataFrame`
[ https://issues.apache.org/jira/browse/SPARK-51676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-51676: - Assignee: Dongjoon Hyun > Support `printSchema` for `DataFrame` > - > > Key: SPARK-51676 > URL: https://issues.apache.org/jira/browse/SPARK-51676 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: connect-swift-0.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51637) Python Data Sources columnar read
[ https://issues.apache.org/jira/browse/SPARK-51637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51637: --- Labels: pull-request-available (was: ) > Python Data Sources columnar read > - > > Key: SPARK-51637 > URL: https://issues.apache.org/jira/browse/SPARK-51637 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.1.0 >Reporter: Haoyu Weng >Priority: Major > Labels: pull-request-available > > Use PartitionReader[ColumnarBatch] in scala for improved performance -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51676) Support `printSchema` for `DataFrame`
Dongjoon Hyun created SPARK-51676: - Summary: Support `printSchema` for `DataFrame` Key: SPARK-51676 URL: https://issues.apache.org/jira/browse/SPARK-51676 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: connect-swift-0.1.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51676) Support `printSchema` for `DataFrame`
[ https://issues.apache.org/jira/browse/SPARK-51676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51676: --- Labels: pull-request-available (was: ) > Support `printSchema` for `DataFrame` > - > > Key: SPARK-51676 > URL: https://issues.apache.org/jira/browse/SPARK-51676 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: connect-swift-0.1.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51674) Remove unnecessary Spark Connect doc link from Spark website
[ https://issues.apache.org/jira/browse/SPARK-51674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pengfei Xu updated SPARK-51674: --- Description: We now have all Classic-only APIs annotated with @ClassicOnly. This tag will be visible in the main Spark API doc. Therefore a separate doc for Spark Connect is not necessary. e should revert SPARK-51288. > Remove unnecessary Spark Connect doc link from Spark website > > > Key: SPARK-51674 > URL: https://issues.apache.org/jira/browse/SPARK-51674 > Project: Spark > Issue Type: Bug > Components: Connect, Documentation >Affects Versions: 4.1.0 >Reporter: Pengfei Xu >Priority: Major > > We now have all Classic-only APIs annotated with @ClassicOnly. This tag will > be visible in the main Spark API doc. Therefore a separate doc for Spark > Connect is not necessary. > e should revert SPARK-51288. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51674) Remove unnecessary Spark Connect doc link from Spark website
[ https://issues.apache.org/jira/browse/SPARK-51674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pengfei Xu updated SPARK-51674: --- Description: We now have all Classic-only APIs annotated with @ClassicOnly. This tag will be visible in the main Spark API doc. Therefore a separate doc for Spark Connect is not necessary. We should revert SPARK-51288. was: We now have all Classic-only APIs annotated with @ClassicOnly. This tag will be visible in the main Spark API doc. Therefore a separate doc for Spark Connect is not necessary. e should revert SPARK-51288. > Remove unnecessary Spark Connect doc link from Spark website > > > Key: SPARK-51674 > URL: https://issues.apache.org/jira/browse/SPARK-51674 > Project: Spark > Issue Type: Bug > Components: Connect, Documentation >Affects Versions: 4.1.0 >Reporter: Pengfei Xu >Priority: Major > > We now have all Classic-only APIs annotated with @ClassicOnly. This tag will > be visible in the main Spark API doc. Therefore a separate doc for Spark > Connect is not necessary. > We should revert SPARK-51288. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-50131) Update isin to accept DataFrame to work as IN subquery
[ https://issues.apache.org/jira/browse/SPARK-50131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-50131: --- Labels: pull-request-available (was: ) > Update isin to accept DataFrame to work as IN subquery > -- > > Key: SPARK-50131 > URL: https://issues.apache.org/jira/browse/SPARK-50131 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-51657) UTF8_BINARY default table collation shown by default in Desc As JSON (v1)
[ https://issues.apache.org/jira/browse/SPARK-51657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-51657: -- Assignee: Amanda Liu > UTF8_BINARY default table collation shown by default in Desc As JSON (v1) > - > > Key: SPARK-51657 > URL: https://issues.apache.org/jira/browse/SPARK-51657 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-51657) UTF8_BINARY default table collation shown by default in Desc As JSON (v1)
[ https://issues.apache.org/jira/browse/SPARK-51657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-51657. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 50451 [https://github.com/apache/spark/pull/50451] > UTF8_BINARY default table collation shown by default in Desc As JSON (v1) > - > > Key: SPARK-51657 > URL: https://issues.apache.org/jira/browse/SPARK-51657 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-51670) Refactor Intersect and Except to follow Union example to reuse in single-pass Analyzer
[ https://issues.apache.org/jira/browse/SPARK-51670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-51670. - Fix Version/s: 4.1.0 Resolution: Fixed Issue resolved by pull request 50465 [https://github.com/apache/spark/pull/50465] > Refactor Intersect and Except to follow Union example to reuse in single-pass > Analyzer > -- > > Key: SPARK-51670 > URL: https://issues.apache.org/jira/browse/SPARK-51670 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.1.0 >Reporter: Mihailo Milosevic >Assignee: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > Fix For: 4.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-51650) Support delete ml cached objects in batch
[ https://issues.apache.org/jira/browse/SPARK-51650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-51650: - Assignee: Ruifeng Zheng > Support delete ml cached objects in batch > - > > Key: SPARK-51650 > URL: https://issues.apache.org/jira/browse/SPARK-51650 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51674) Remove unnecessary Spark Connect doc link from Spark website
[ https://issues.apache.org/jira/browse/SPARK-51674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51674: --- Labels: pull-request-available (was: ) > Remove unnecessary Spark Connect doc link from Spark website > > > Key: SPARK-51674 > URL: https://issues.apache.org/jira/browse/SPARK-51674 > Project: Spark > Issue Type: Bug > Components: Connect, Documentation >Affects Versions: 4.1.0 >Reporter: Pengfei Xu >Priority: Major > Labels: pull-request-available > > We now have all Classic-only APIs annotated with @ClassicOnly. This tag will > be visible in the main Spark API doc. Therefore a separate doc for Spark > Connect is not necessary. > We should revert SPARK-51288. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51675) Fix issue around creating col family after db open
Anish Shrigondekar created SPARK-51675: -- Summary: Fix issue around creating col family after db open Key: SPARK-51675 URL: https://issues.apache.org/jira/browse/SPARK-51675 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Anish Shrigondekar Fix issue around creating col family after db open -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51668) Collect metrics for data source V2 in case the command fails
Jan-Ole Sasse created SPARK-51668: - Summary: Collect metrics for data source V2 in case the command fails Key: SPARK-51668 URL: https://issues.apache.org/jira/browse/SPARK-51668 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Jan-Ole Sasse -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33598) Support Java Class with circular references
[ https://issues.apache.org/jira/browse/SPARK-33598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939572#comment-17939572 ] Tim Robertson commented on SPARK-33598: --- Chiming in just to say I've just encountered this while porting some code from Avro to POJOs in a Spark 3.4.x environment. Bean A has references to Bean B, which in turn has a field of type Bean A. > Support Java Class with circular references > --- > > Key: SPARK-33598 > URL: https://issues.apache.org/jira/browse/SPARK-33598 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 3.1.2 >Reporter: jacklzg >Priority: Minor > > If the target Java data class has a circular reference, Spark will fail fast > from creating the Dataset or running Encoders. > > For example, with protobuf class, there is a reference with Descriptor, there > is no way to build a dataset from the protobuf class. > From this line > {color:#7a869a}Encoders.bean(ProtoBuffOuterClass.ProtoBuff.class);{color} > > It will throw out immediately > > {quote}Exception in thread "main" java.lang.UnsupportedOperationException: > Cannot have circular references in bean class, but got the circular reference > of class class com.google.protobuf.Descriptors$Descriptor > {quote} > > Can we add a parameter, for example, > > {code:java} > Encoders.bean(Class clas, List fieldsToIgnore);{code} > > or > > {code:java} > Encoders.bean(Class clas, boolean skipCircularRefField);{code} > > which subsequently, instead of throwing an exception @ > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L556], > it instead skip the field. > > {code:java} > if (seenTypeSet.contains(t)) { > if(skipCircularRefField) > println("field skipped") //just skip this field > else throw new UnsupportedOperationException( s"cannot have circular > references in class, but got the circular reference of class $t") > } > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51672) Regenerate golden files with collation aliases in branch-4.0
[ https://issues.apache.org/jira/browse/SPARK-51672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51672: --- Labels: pull-request-available (was: ) > Regenerate golden files with collation aliases in branch-4.0 > > > Key: SPARK-51672 > URL: https://issues.apache.org/jira/browse/SPARK-51672 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Vladimir Golubev >Priority: Major > Labels: pull-request-available > > Regenerate golden files with collation aliases in branch-4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51669) Generate random TIME values in tests
[ https://issues.apache.org/jira/browse/SPARK-51669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51669: --- Labels: pull-request-available (was: ) > Generate random TIME values in tests > > > Key: SPARK-51669 > URL: https://issues.apache.org/jira/browse/SPARK-51669 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.1.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > > Extend RandomDataGenerator to support the new data type TIME. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-51669) Generate random TIME values in tests
[ https://issues.apache.org/jira/browse/SPARK-51669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-51669. -- Fix Version/s: 4.1.0 Resolution: Fixed Issue resolved by pull request 50462 [https://github.com/apache/spark/pull/50462] > Generate random TIME values in tests > > > Key: SPARK-51669 > URL: https://issues.apache.org/jira/browse/SPARK-51669 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.1.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > Fix For: 4.1.0 > > > Extend RandomDataGenerator to support the new data type TIME. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-51664) Support the TIME data type in the Hash expression
[ https://issues.apache.org/jira/browse/SPARK-51664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-51664. -- Fix Version/s: 4.1.0 Resolution: Fixed Issue resolved by pull request 50456 [https://github.com/apache/spark/pull/50456] > Support the TIME data type in the Hash expression > - > > Key: SPARK-51664 > URL: https://issues.apache.org/jira/browse/SPARK-51664 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.1.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > Fix For: 4.1.0 > > > Modify the HashExpression expression to support the TIME data type. In > particular, the following classes are affected: > - Murmur3Hash: generating partition ID, > - HiveHash: bucking > - XxHash64: Bloom filter -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51669) Generate random TIME values in tests
Max Gekk created SPARK-51669: Summary: Generate random TIME values in tests Key: SPARK-51669 URL: https://issues.apache.org/jira/browse/SPARK-51669 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.1.0 Reporter: Max Gekk Assignee: Max Gekk Extend RandomDataGenerator to support the new data type TIME. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-51663) Short circuit && operation for JoinSelectionHelper
[ https://issues.apache.org/jira/browse/SPARK-51663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng resolved SPARK-51663. Fix Version/s: 4.1.0 Resolution: Fixed > Short circuit && operation for JoinSelectionHelper > -- > > Key: SPARK-51663 > URL: https://issues.apache.org/jira/browse/SPARK-51663 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.1.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > Fix For: 4.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36600) Optimise speed and memory with Pyspark when create DataFrame (with patch)
[ https://issues.apache.org/jira/browse/SPARK-36600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-36600: --- Labels: easyfix pull-request-available (was: easyfix) > Optimise speed and memory with Pyspark when create DataFrame (with patch) > - > > Key: SPARK-36600 > URL: https://issues.apache.org/jira/browse/SPARK-36600 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Philippe Prados >Priority: Trivial > Labels: easyfix, pull-request-available > Attachments: optimize_memory_pyspark.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > The Python method {{SparkSession._createFromLocal()}} start to the data, and > create a list if it's not an instance of list. But it is necessary only if > the scheme is not present. > {quote}# make sure data could consumed multiple times > if not isinstance(data, list): > data = list(data) > {quote} > If you use {{createDataFrame(data=_a_generator_,...)}}, all the datas were > save in memory in a list, then convert to a row in memory, then convert to > buffer in pickle format, etc. > Two lists were present at the same time in memory. The list created by > _createFromLocal() and the list created later with > {quote}# convert python objects to sql data > data = [schema.toInternal(row) for row in data] > {quote} > The purpose of using a generator is to reduce the memory footprint when the > data are dynamically build. > {quote}def _createFromLocal(self, data, schema): > """ > Create an RDD for DataFrame from a list or pandas.DataFrame, returns > the RDD and schema. > """ > if schema is None or isinstance(schema, (list, tuple)): > *# make sure data could consumed multiple times* > *if inspect.isgeneratorfunction(data):* > *data = list(data)* > struct = self._inferSchemaFromList(data, names=schema) > converter = _create_converter(struct) > data = map(converter, data) > if isinstance(schema, (list, tuple)): > for i, name in enumerate(schema): > struct.fields[i].name = name > struct.names[i] = name > schema = struct > elif not isinstance(schema, StructType): > raise TypeError("schema should be StructType or list or None, but got: > %s" % schema) > # convert python objects to sql data > data = [schema.toInternal(row) for row in data] > return self._sc.parallelize(data), schema{quote} > Then, it is interesting to use a generator. > > {quote}The patch: > diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py > index 57c680fd04..0dba590451 100644 > --- a/python/pyspark/sql/session.py > +++ b/python/pyspark/sql/session.py > @@ -15,6 +15,7 @@ > # limitations under the License. > # > > +import inspect > import sys > import warnings > from functools import reduce > @@ -504,11 +505,11 @@ class SparkSession(SparkConversionMixin): > Create an RDD for DataFrame from a list or pandas.DataFrame, returns > the RDD and schema. > """ > - # make sure data could consumed multiple times > - if not isinstance(data, list): > - data = list(data) > > if schema is None or isinstance(schema, (list, tuple)): > + # make sure data could consumed multiple times > + if inspect.isgeneratorfunction(data): # PPR > + data = list(data) > struct = self._inferSchemaFromList(data, names=schema) > converter = _create_converter(struct) > data = map(converter, data) > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51671) Add column pruning to Recursive CTEs
Pavle Martinović created SPARK-51671: Summary: Add column pruning to Recursive CTEs Key: SPARK-51671 URL: https://issues.apache.org/jira/browse/SPARK-51671 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.1.0 Reporter: Pavle Martinović -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-51673) Apply default collation for Alter view query
Marko Ilic created SPARK-51673: -- Summary: Apply default collation for Alter view query Key: SPARK-51673 URL: https://issues.apache.org/jira/browse/SPARK-51673 Project: Spark Issue Type: Test Components: Spark Core Affects Versions: 4.0.0 Reporter: Marko Ilic Default collation is not applied when alter view is done. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51673) Apply default collation for Alter view query
[ https://issues.apache.org/jira/browse/SPARK-51673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51673: --- Labels: pull-request-available (was: ) > Apply default collation for Alter view query > > > Key: SPARK-51673 > URL: https://issues.apache.org/jira/browse/SPARK-51673 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Marko Ilic >Priority: Major > Labels: pull-request-available > > Default collation is not applied when alter view is done. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30451) Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have functions to remove requests
[ https://issues.apache.org/jira/browse/SPARK-30451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939692#comment-17939692 ] Jayadeep Jayaraman commented on SPARK-30451: Hi [~tgraves] - I would like to work on this request. As per the description my understanding is {code:java} class ExecutorResourceRequests() {code} is a convenience method for the class {code:java} class ExecutorResourceRequest(){code} Class `ExecutorResourceRequests() ` has multiple methods to add different resources {code:java} def offHeapMemory(amount: String): this.type def memoryOverhead(amount: String): this.type def memory(amount: String): this.type{code} and they are stored in a mutable map {code:java} new ConcurrentHashMap[String, ExecutorResourceRequest](){code} The goal of this PR is to offer an API that allows users to build a Request profile once as shown below {code:java} val executorReqs: ExecutorResourceRequests = ExecutorResourceRequests.builder() .memory("1g") .cores(4) .resource("gpu", 2) .build() {code} and then offer ways to add or remove specific resources as shown below {code:java} val anotherReqs: ExecutorResourceRequests = ExecutorResourceRequests.builder() .offHeapMemory("512m") .add(new ExecutorResourceRequest("fpga", 1)) .remove("memory") // This would have no effect as 'memory' wasn't added in this builder .build() {code} Kindly, let me know if my understanding is correct and I would like to work on this task > Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have > functions to remove requests > - > > Key: SPARK-30451 > URL: https://issues.apache.org/jira/browse/SPARK-30451 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Priority: Major > > Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have > functions to remove requests > Currently in the design ExecutorResourceRequests and TaskREsourceRequests are > mutable and users can update as they want. It would make sense to add api's > to remove certain resource requirements from them. This would allow a user to > create one ExecutorResourceRequests object and then if they want to just > add/remove something from it they easily could without having to recreate all > the requests in that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30451) Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have functions to remove requests
[ https://issues.apache.org/jira/browse/SPARK-30451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939692#comment-17939692 ] Jayadeep Jayaraman edited comment on SPARK-30451 at 3/31/25 1:23 PM: - Hi [~tgraves] - I would like to work on this request. As per the description my understanding is {code:java} class ExecutorResourceRequests() {code} is a convenience method for the class {code:java} class ExecutorResourceRequest(){code} Class `ExecutorResourceRequests() ` has multiple methods to add different resources {code:java} def offHeapMemory(amount: String): this.type def memoryOverhead(amount: String): this.type def memory(amount: String): this.type{code} and they are stored in a mutable map {code:java} new ConcurrentHashMap[String, ExecutorResourceRequest](){code} The goal of this PR is to offer an API that allows users to build a Request profile once as shown below {code:java} val executorReqs: ExecutorResourceRequests = ExecutorResourceRequests.builder() .memory("1g") .cores(4) .resource("gpu", 2) .build() {code} and then offer ways to add or remove specific resources as shown below {code:java} val anotherReqs: ExecutorResourceRequests = ExecutorResourceRequests.builder() .offHeapMemory("512m") .add(new ExecutorResourceRequest("fpga", 1)) .remove("memory") // This would have no effect as 'memory' wasn't added in this builder .build() {code} Kindly, let me know if my understanding is correct and I would like to work on this task was (Author: jjayadeep): Hi [~tgraves] - I would like to work on this request. As per the description my understanding is {code:java} class ExecutorResourceRequests() {code} is a convenience method for the class {code:java} class ExecutorResourceRequest(){code} Class `ExecutorResourceRequests() ` has multiple methods to add different resources {code:java} def offHeapMemory(amount: String): this.type def memoryOverhead(amount: String): this.type def memory(amount: String): this.type{code} and they are stored in a mutable map {code:java} new ConcurrentHashMap[String, ExecutorResourceRequest](){code} The goal of this PR is to offer an API that allows users to build a Request profile once as shown below {code:java} val executorReqs: ExecutorResourceRequests = ExecutorResourceRequests.builder() .memory("1g") .cores(4) .resource("gpu", 2) .build() {code} and then offer ways to add or remove specific resources as shown below {code:java} val anotherReqs: ExecutorResourceRequests = ExecutorResourceRequests.builder() .offHeapMemory("512m") .add(new ExecutorResourceRequest("fpga", 1)) .remove("memory") // This would have no effect as 'memory' wasn't added in this builder .build() {code} Kindly, let me know if my understanding is correct and I would like to work on this task > Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have > functions to remove requests > - > > Key: SPARK-30451 > URL: https://issues.apache.org/jira/browse/SPARK-30451 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Priority: Major > > Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have > functions to remove requests > Currently in the design ExecutorResourceRequests and TaskREsourceRequests are > mutable and users can update as they want. It would make sense to add api's > to remove certain resource requirements from them. This would allow a user to > create one ExecutorResourceRequests object and then if they want to just > add/remove something from it they easily could without having to recreate all > the requests in that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-51670) Refactor Intersect and Except to follow Union example to reuse in single-pass Analyzer
[ https://issues.apache.org/jira/browse/SPARK-51670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-51670: --- Labels: pull-request-available (was: ) > Refactor Intersect and Except to follow Union example to reuse in single-pass > Analyzer > -- > > Key: SPARK-51670 > URL: https://issues.apache.org/jira/browse/SPARK-51670 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.1.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org