date:20250224

Re: [PR] [SPARK-51305][SQL][CONNECT] Improve `SparkConnectPlanExecution.createObservedMetricsResponse` [spark]

2025-02-24 Thread via GitHub



beliefer commented on PR #50066:
URL: https://github.com/apache/spark/pull/50066#issuecomment-2680591378

   @dongjoon-hyun Thank you !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]

2025-02-24 Thread via GitHub



cloud-fan commented on code in PR #50040:
URL: https://github.com/apache/spark/pull/50040#discussion_r1968959548


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala:
##
@@ -841,20 +841,24 @@ class DataFrameWriterV2Suite extends QueryTest with 
SharedSparkSession with Befo
   }
 
   test("SPARK-51281: create/replace file source tables") {
+def checkResults(df: DataFrame): Unit = {
+  checkAnswer(df, spark.range(10).toDF())
+}

Review Comment:
   what do you mean?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51265][SQL][SS] Throw proper error for eagerlyExecuteCommands containing streaming source marker [spark]

2025-02-24 Thread via GitHub



HeartSaVioR closed pull request #50015: [SPARK-51265][SQL][SS] Throw proper 
error for eagerlyExecuteCommands containing streaming source marker
URL: https://github.com/apache/spark/pull/50015


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51265][SQL][SS] Throw proper error for eagerlyExecuteCommands containing streaming source marker [spark]

2025-02-24 Thread via GitHub



HeartSaVioR commented on PR #50015:
URL: https://github.com/apache/spark/pull/50015#issuecomment-2680891589

   Closing via #50037 - much simpler change and both of PRs do not address the 
origin report which @cloud-fan will address later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51187][SQL][SS] Implement the graceful deprecation of incorrect config introduced in SPARK-49699 [spark]

2025-02-24 Thread via GitHub



HeartSaVioR commented on PR #49983:
URL: https://github.com/apache/spark/pull/49983#issuecomment-2680925857

   @cloud-fan 
   
   > have we merged this graceful deprecation in branch 3.5?
   
   Yes, that is merged. It's still a blocker for Spark 4.0.0 though.
   
   @dongjoon-hyun 
   
   > If you don't mind, I'd like to propose to wait until we finish Apache 
Spark 3.5.5 release.
   
   Would you mind explaining why this has to be coupled with Spark 3.5.5, as 
discussion for 4.1/4.0 can always happen in parallel?
   
   I sincerely disagree that the option 2 is viable for us. This builds a huge 
limitation of SS users for Spark 3.5.4 on upgrading (they have to touch Spark 
3.5.x before upgrading to 4.0.0), to achieve our goal of removing the "string" 
of incorrect config immediately. Again, this is just a string because it's only 
for migration logic and users won't be able to set the config manually. Why not 
just be more open for Spark 3.5.4 users to have a range of upgrade path?
   
   If you'd like to hear more voices, I'm happy to post the discussion to dev@. 
I just don't want this change to be coupled with non-relevant release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-24 Thread via GitHub



pan3793 commented on code in PR #50022:
URL: https://github.com/apache/spark/pull/50022#discussion_r1969188058


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1407,13 +1410,83 @@ private[hive] object HiveClientImpl extends Logging {
   case _ =>
 new HiveConf(conf, classOf[HiveConf])
 }
-try {
+val hive = try {
   Hive.getWithoutRegisterFns(hiveConf)
 } catch {
   // SPARK-37069: not all Hive versions have the above method (e.g., Hive 
2.3.9 has it but
-  // 2.3.8 don't), therefore here we fallback when encountering the 
exception.
+  // 2.3.8 doesn't), therefore here we fallback when encountering the 
exception.
   case _: NoSuchMethodError =>
 Hive.get(hiveConf)
 }
+
+// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size 
when
+// `hive.thrift.client.max.message.size` is set and the value is positive
+Option(hiveConf.get("hive.thrift.client.max.message.size"))
+  .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0)
+  .foreach { maxMessageSize =>
+logDebug(s"Trying to set metastore client thrift max message to 
$maxMessageSize")
+configureMaxThriftMessageSize(hiveConf, hive.getMSC, maxMessageSize)
+  }
+
+hive
+  }
+
+  private def getFieldValue[T](obj: Any, fieldName: String): T = {
+val field = obj.getClass.getDeclaredField(fieldName)
+field.setAccessible(true)
+field.get(obj).asInstanceOf[T]
+  }
+
+  private def getFieldValue[T](obj: Any, clazz: Class[_], fieldName: String): 
T = {
+val field = clazz.getDeclaredField(fieldName)
+field.setAccessible(true)
+field.get(obj).asInstanceOf[T]
+  }
+
+  // SPARK-49489: a surgery for Hive 2.3.10 due to lack of HIVE-26633
+  private def configureMaxThriftMessageSize(
+  hiveConf: HiveConf, msClient: IMetaStoreClient, maxMessageSize: Int): 
Unit = try {
+msClient match {
+  // Hive uses Java Dynamic Proxy to enhance the MetaStoreClient to 
support synchronization
+  // and retrying, we should unwrap and access the underlying 
MetaStoreClient instance firstly
+  case proxy if JdkProxy.isProxyClass(proxy.getClass) =>
+JdkProxy.getInvocationHandler(proxy) match {
+  case syncHandler if 
syncHandler.getClass.getName.endsWith("SynchronizedHandler") =>
+val wrappedMsc = getFieldValue[IMetaStoreClient](syncHandler, 
"client")
+configureMaxThriftMessageSize(hiveConf, wrappedMsc, maxMessageSize)
+  case retryHandler: RetryingMetaStoreClient =>
+val wrappedMsc = getFieldValue[IMetaStoreClient](retryHandler, 
"base")
+configureMaxThriftMessageSize(hiveConf, wrappedMsc, maxMessageSize)
+  case _ =>
+}
+  case msc: HiveMetaStoreClient if !msc.isLocalMetaStore =>
+@tailrec
+def configure(t: TTransport): Unit = t match {
+  // Unwrap and access the underlying TTransport when security enabled 
(Kerberos)
+  case tTransport: TFilterTransport =>

Review Comment:
   @@Madhukar525722 kerberos cases are addressed here



##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1407,13 +1410,83 @@ private[hive] object HiveClientImpl extends Logging {
   case _ =>
 new HiveConf(conf, classOf[HiveConf])
 }
-try {
+val hive = try {
   Hive.getWithoutRegisterFns(hiveConf)
 } catch {
   // SPARK-37069: not all Hive versions have the above method (e.g., Hive 
2.3.9 has it but
-  // 2.3.8 don't), therefore here we fallback when encountering the 
exception.
+  // 2.3.8 doesn't), therefore here we fallback when encountering the 
exception.
   case _: NoSuchMethodError =>
 Hive.get(hiveConf)
 }
+
+// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size 
when
+// `hive.thrift.client.max.message.size` is set and the value is positive
+Option(hiveConf.get("hive.thrift.client.max.message.size"))
+  .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0)
+  .foreach { maxMessageSize =>
+logDebug(s"Trying to set metastore client thrift max message to 
$maxMessageSize")
+configureMaxThriftMessageSize(hiveConf, hive.getMSC, maxMessageSize)
+  }
+
+hive
+  }
+
+  private def getFieldValue[T](obj: Any, fieldName: String): T = {
+val field = obj.getClass.getDeclaredField(fieldName)
+field.setAccessible(true)
+field.get(obj).asInstanceOf[T]
+  }
+
+  private def getFieldValue[T](obj: Any, clazz: Class[_], fieldName: String): 
T = {
+val field = clazz.getDeclaredField(fieldName)
+field.setAccessible(true)
+field.get(obj).asInstanceOf[T]
+  }
+
+  // SPARK-49489: a surgery for Hive 2.3.10 due to lack of HIVE-26633
+  private def configureMaxThriftMessageSize(
+  hiveConf: HiveConf, msClient: IMetaStoreClient, maxMessageSize: Int): 
Unit = try {
+msClient match {
+  /

Re: [PR] [SPARK-50856][SS][PYTHON][CONNECT] Spark Connect Support for TransformWithStateInPandas In Python [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #49560:
URL: https://github.com/apache/spark/pull/49560#discussion_r1968788349


##
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -1034,6 +1038,49 @@ class SparkConnectPlanner(
   .logicalPlan
   }
 
+  private def transformTransformWithStateInPandas(
+  pythonUdf: PythonUDF,
+  groupedDs: RelationalGroupedDataset,
+  rel: proto.GroupMap): LogicalPlan = {
+val twsInfo = rel.getTransformWithStateInfo
+val outputSchema = parseSchema(twsInfo.getOutputSchema)
+
+if (rel.hasInitialInput) {
+  val initialGroupingCols = 
rel.getInitialGroupingExpressionsList.asScala.toSeq.map(expr =>
+Column(transformExpression(expr)))
+
+  val initialStateDs = Dataset
+.ofRows(session, transformRelation(rel.getInitialInput))

Review Comment:
   I am aware it is used in multiple places in the planner. That does not mean 
it should be encouraged. We want to get rid of most of this after Spark 4.0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51289][SQL] Throw a proper error message for not fully implemented `SQLTableFunction` [spark]

2025-02-24 Thread via GitHub



wayneguow commented on PR #50073:
URL: https://github.com/apache/spark/pull/50073#issuecomment-2680420915

   cc @MaxGekk  @allisonwang-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50692][SQL][FOLLOWUP] Add the LPAD and RPAD pushdown support for H2 [spark]

2025-02-24 Thread via GitHub



beliefer commented on PR #50068:
URL: https://github.com/apache/spark/pull/50068#issuecomment-2680314901

   > Oh, did you aim to use this as a follow-up, @beliefer ?
   
   Uh, I forgot it. I want it to be a follow-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51308][CONNECT][BUILD] Update the relocation rules for the `connect` module in `SparkBuild.scala` to ensure that both Maven and SBT produce the assembly JAR according to the same rules [sp

2025-02-24 Thread via GitHub



wayneguow opened a new pull request, #50075:
URL: https://github.com/apache/spark/pull/50075

   
   
   ### What changes were proposed in this pull request?
   
   This PR aims to update the relocation rules for the `connect` module in 
`SparkBuild.scala`.
   
   ### Why are the changes needed?
   
   Ensure that both Maven and SBT produce the assembly JAR according to the 
same rules.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass GA.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]

2025-02-24 Thread via GitHub



dongjoon-hyun commented on code in PR #50040:
URL: https://github.com/apache/spark/pull/50040#discussion_r1969044495


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -5545,6 +5545,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION =
+buildConf("spark.sql.legacy.dataFrameWriterV2IgnorePathOption")
+  .internal()
+  .doc("When set to true, DataFrameWriterV2 ignores the 'path' option and 
always write data " +
+"to the default table location.")
+  .version("3.5.5")

Review Comment:
   > I'll change it to 3.5.6 after the 3.5.5 RC1 vote passes.
   
   Ya, let's wait and see the vote result because it's still open.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]

2025-02-24 Thread via GitHub



szehon-ho commented on code in PR #50040:
URL: https://github.com/apache/spark/pull/50040#discussion_r1969046534


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala:
##
@@ -839,4 +839,30 @@ class DataFrameWriterV2Suite extends QueryTest with 
SharedSparkSession with Befo
   condition = "CALL_ON_STREAMING_DATASET_UNSUPPORTED",
   parameters = Map("methodName" -> "`writeTo`"))
   }
+
+  test("SPARK-51281: create/replace file source tables") {

Review Comment:
   question: are we testing replace here?  Why not just the name of the jira?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51309][BUILD] Upgrade rocksdbjni to 9.10.0 [spark]

2025-02-24 Thread via GitHub



wayneguow commented on PR #50076:
URL: https://github.com/apache/spark/pull/50076#issuecomment-2680812582

   Related benchmark results:
   
   - jdk17: https://github.com/wayneguow/spark/actions/runs/13513028574
   - jdk21: https://github.com/wayneguow/spark/actions/runs/13513032754


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51309][BUILD] Upgrade rocksdbjni to 9.10.0 [spark]

2025-02-24 Thread via GitHub



wayneguow opened a new pull request, #50076:
URL: https://github.com/apache/spark/pull/50076

   
   
   ### What changes were proposed in this pull request?
   
   The pr aims to upgrade `rocksdbjni` from 9.8.4 to 9.10.0.
   
   ### Why are the changes needed?
   
   There are some bug fixes and performance Improvements, full release notes:
   
   - https://github.com/facebook/rocksdb/releases/tag/v9.9.3
   - https://github.com/facebook/rocksdb/releases/tag/v9.10.0
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass GA.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]

2025-02-24 Thread via GitHub



micheal-o commented on code in PR #50030:
URL: https://github.com/apache/spark/pull/50030#discussion_r1968977014


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala:
##
@@ -419,6 +432,10 @@ private[sql] class HDFSBackedStateStoreProvider extends 
StateStoreProvider with
   private val loadedMapCacheHitCount: LongAdder = new LongAdder
   private val loadedMapCacheMissCount: LongAdder = new LongAdder
 
+  // This is updated when the maintenance task writes the snapshot file and 
read by the task

Review Comment:
   it is possible for task thread to do snapshot upload too right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51289][SQL] Throw a proper error message for not fully implemented `SQLTableFunction` [spark]

2025-02-24 Thread via GitHub



LuciferYang commented on PR #50073:
URL: https://github.com/apache/spark/pull/50073#issuecomment-2680871903

   also cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]

2025-02-24 Thread via GitHub



liviazhu-db commented on code in PR #50030:
URL: https://github.com/apache/spark/pull/50030#discussion_r1968325595


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala:
##
@@ -419,6 +433,10 @@ private[sql] class HDFSBackedStateStoreProvider extends 
StateStoreProvider with
   private val loadedMapCacheHitCount: LongAdder = new LongAdder
   private val loadedMapCacheMissCount: LongAdder = new LongAdder
 
+  // This is updated when the maintenance task writes the snapshot file, -1 
represents no version
+  // has ever been uploaded.
+  private var lastSnapshotUploadedVersion: Long = -1L

Review Comment:
   Does this not need to be an AtomicLong, since maintenance thread will modify 
it and task thread will read it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]

2025-02-24 Thread via GitHub



liviazhu-db commented on code in PR #50030:
URL: https://github.com/apache/spark/pull/50030#discussion_r1968325595


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala:
##
@@ -419,6 +433,10 @@ private[sql] class HDFSBackedStateStoreProvider extends 
StateStoreProvider with
   private val loadedMapCacheHitCount: LongAdder = new LongAdder
   private val loadedMapCacheMissCount: LongAdder = new LongAdder
 
+  // This is updated when the maintenance task writes the snapshot file, -1 
represents no version
+  // has ever been uploaded.
+  private var lastSnapshotUploadedVersion: Long = -1L

Review Comment:
   Does this not need to be an AtomicLong, since maintenance thread will modify 
it and task thread will read it? Can you add this in a comment as well?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala:
##
@@ -419,6 +433,10 @@ private[sql] class HDFSBackedStateStoreProvider extends 
StateStoreProvider with
   private val loadedMapCacheHitCount: LongAdder = new LongAdder
   private val loadedMapCacheMissCount: LongAdder = new LongAdder
 
+  // This is updated when the maintenance task writes the snapshot file, -1 
represents no version
+  // has ever been uploaded.
+  private var lastSnapshotUploadedVersion: Long = -1L

Review Comment:
   Doesn't this need to be an AtomicLong, since maintenance thread will modify 
it and task thread will read it? Can you add this in a comment as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51290][SQL] Enable filling default values in DSv2 writes [spark]

2025-02-24 Thread via GitHub



viirya commented on code in PR #50044:
URL: https://github.com/apache/spark/pull/50044#discussion_r1968152464


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##
@@ -718,6 +724,11 @@ private class BufferedRowsReader(
   schema: StructType,
   row: InternalRow): Any = {
 val index = schema.fieldIndex(field.name)
+
+if (index >= row.numFields) {

Review Comment:
   This is method `extractFieldValue`. Looks like it is only used by `get`. Why 
this is for adding columns?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50856][SS][PYTHON][CONNECT] Spark Connect Support for TransformWithStateInPandas In Python [spark]

2025-02-24 Thread via GitHub



jingz-db commented on code in PR #49560:
URL: https://github.com/apache/spark/pull/49560#discussion_r1968245505


##
sql/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -1031,6 +1031,26 @@ message GroupMap {
 
   // (Optional) The schema for the grouped state.
   optional DataType state_schema = 10;
+
+  // Below fields are used by TransformWithState and TransformWithStateInPandas
+  // (Optional) TransformWithState related parameters.
+  optional TransformWithStateInfo transform_with_state_info = 11;
+}
+
+// Additional input parameters used for TransformWithState operator.
+message TransformWithStateInfo {

Review Comment:
   > This would look better, however, it is not backward compatible.
   
   Yeah agreed... I guess it is not very safe to change the connect protocol 
for a released operator. We should probably leave FMGWS as it is in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-24 Thread via GitHub



Madhukar525722 commented on PR #50022:
URL: https://github.com/apache/spark/pull/50022#issuecomment-2679438548

   HI @pan3793 . The flow was able to reach the case msc:
   But there I added the debug log - 
   ```
   msc.getTTransport match {
 case t: TEndpointTransport =>
   val currentMaxMessageSize = 
t.getConfiguration.getMaxMessageSize
  ...
 case _ => logDebug(s"The metastore client transport is not 
TEndpointTransport, but: ${msc.getTTransport.getClass.getName}")
   }
   ```
   
   and found corresponding value to be
   The metastore client transport is not TEndpointTransport, but: 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport
   
   Does it require, more unwrapping before, getting the exact TEndpointTransport


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]

2025-02-24 Thread via GitHub



liviazhu-db commented on code in PR #50030:
URL: https://github.com/apache/spark/pull/50030#discussion_r1968396281


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala:
##
@@ -219,7 +219,18 @@ private[sql] class HDFSBackedStateStoreProvider extends 
StateStoreProvider with
 supportedCustomMetrics.find(_.name == name).map(_ -> value)
   } + (metricStateOnCurrentVersionSizeBytes -> 
SizeEstimator.estimate(mapToUpdate))
 
-  StateStoreMetrics(mapToUpdate.size(), 
metricsFromProvider("memoryUsedBytes"), customMetrics)
+  val instanceMetrics = Map(
+instanceMetricSnapshotLastUpload.withNewId(

Review Comment:
   Oh I see, makes sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]

2025-02-24 Thread via GitHub



liviazhu-db commented on code in PR #50030:
URL: https://github.com/apache/spark/pull/50030#discussion_r1968332195


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala:
##
@@ -219,7 +219,18 @@ private[sql] class HDFSBackedStateStoreProvider extends 
StateStoreProvider with
 supportedCustomMetrics.find(_.name == name).map(_ -> value)
   } + (metricStateOnCurrentVersionSizeBytes -> 
SizeEstimator.estimate(mapToUpdate))
 
-  StateStoreMetrics(mapToUpdate.size(), 
metricsFromProvider("memoryUsedBytes"), customMetrics)
+  val instanceMetrics = Map(
+instanceMetricSnapshotLastUpload.withNewId(

Review Comment:
   Why do we do withNewId here and not assign these values when we create 
`instanceMetricSnapshotLastUpload` on line 452?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]

2025-02-24 Thread via GitHub



zecookiez commented on code in PR #50030:
URL: https://github.com/apache/spark/pull/50030#discussion_r1968378429


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala:
##
@@ -219,7 +219,18 @@ private[sql] class HDFSBackedStateStoreProvider extends 
StateStoreProvider with
 supportedCustomMetrics.find(_.name == name).map(_ -> value)
   } + (metricStateOnCurrentVersionSizeBytes -> 
SizeEstimator.estimate(mapToUpdate))
 
-  StateStoreMetrics(mapToUpdate.size(), 
metricsFromProvider("memoryUsedBytes"), customMetrics)
+  val instanceMetrics = Map(
+instanceMetricSnapshotLastUpload.withNewId(

Review Comment:
   Good question, it looks like the main issue is this:
   
![image](https://github.com/user-attachments/assets/34ccb451-a7f1-43d4-83c6-769701c053a9)
   
   Seems like state store ID is initialized at a later point in time than the 
metrics, and the default value is null



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]

2025-02-24 Thread via GitHub



zecookiez commented on code in PR #50030:
URL: https://github.com/apache/spark/pull/50030#discussion_r1968381328


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala:
##
@@ -419,6 +433,10 @@ private[sql] class HDFSBackedStateStoreProvider extends 
StateStoreProvider with
   private val loadedMapCacheHitCount: LongAdder = new LongAdder
   private val loadedMapCacheMissCount: LongAdder = new LongAdder
 
+  // This is updated when the maintenance task writes the snapshot file, -1 
represents no version
+  // has ever been uploaded.
+  private var lastSnapshotUploadedVersion: Long = -1L

Review Comment:
   Good catch, I initially thought this wasn't needed for HDFS but I misread 
how we're using `writeSnapshot`. Updated with comment!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50914][PYTHON][CONNECT] Match GRPC dependencies for Python-only master scheduled job [spark]

2025-02-24 Thread via GitHub



HyukjinKwon commented on PR #50058:
URL: https://github.com/apache/spark/pull/50058#issuecomment-2679997693

   cc @dongjoon-hyun This should fix the scheduled build and make it green 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50914][PYTHON][CONNECT] Match GRPC dependencies for Python-only master scheduled job [spark]

2025-02-24 Thread via GitHub



dongjoon-hyun commented on PR #50058:
URL: https://github.com/apache/spark/pull/50058#issuecomment-268537

   Thank you, @HyukjinKwon !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51307][SQL] locationUri in CatalogStorageFormat shall be decoded for display [spark]

2025-02-24 Thread via GitHub



yaooqinn opened a new pull request, #50074:
URL: https://github.com/apache/spark/pull/50074

   ### What changes were proposed in this pull request?
   
   
   This PR uses CatalogUtils.URIToString instead of URI.toString to decode the 
location URI.
   
   
   ### Why are the changes needed?
   
   For example, for partition specs like test1=X'16', 
test3=timestamp'2018-11-17 13:33:33', the stored path will include them as 
`test1=%16/test3=2018-11-17 13%3A33%3A33` because the special characters are 
escaped. Furthermore, while resolving the whole path string to a URI object, 
this path fragment becomes `test1=%2516/test3=2018-11-17 13%253A33%253A33`, so 
we need to decode `%25` -> `%` before displaying to users
   
   ### Does this PR introduce _any_ user-facing change?
   yes, DESC TABLE  will not show 2x-encoded paths.
   
   
   ### How was this patch tested?
   new tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51304][DOCS][PYTHON] Use `getCondition` instead of `getErrorClass` in contribution guide [spark]

2025-02-24 Thread via GitHub



itholic commented on PR #50062:
URL: https://github.com/apache/spark/pull/50062#issuecomment-2680085229

   Thanks all for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51289][SQL] Throw a proper error message for not fully implemented `SQLTableFunction` [spark]

2025-02-24 Thread via GitHub



wayneguow opened a new pull request, #50073:
URL: https://github.com/apache/spark/pull/50073

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #50013:
URL: https://github.com/apache/spark/pull/50013#discussion_r1968733126


##
mllib/src/main/scala/org/apache/spark/ml/classification/FMClassifier.scala:
##
@@ -235,6 +236,13 @@ class FMClassifier @Since("3.0.0") (
 model.setSummary(Some(summary))
   }
 
+  override def estimateModelSize(dataset: Dataset[_]): Long = {
+val numFeatures = DatasetUtils.getNumFeatures(dataset, $(featuresCol))

Review Comment:
   I am worried that executing the input query multiple times will be 
considered wasteful. I am wondering if we should do the check while fitting 
instead; we fail as soon as the model gets too large.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #50013:
URL: https://github.com/apache/spark/pull/50013#discussion_r1968734181


##
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala:
##
@@ -1248,6 +1263,11 @@ class LogisticRegressionModel private[spark] (
 }
   }
 
+  private[spark] override def estimatedSize: Long = {
+SizeEstimator.estimate((this.params, this.uid,

Review Comment:
   NIT: Can we just make the SizeEstimator take varargs instead of creating a 
tuple here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #50013:
URL: https://github.com/apache/spark/pull/50013#discussion_r1968738761


##
common/utils/src/main/resources/error/error-conditions.json:
##
@@ -780,6 +780,11 @@
   "Cannot retrieve  from the ML cache. It is probably 
because the entry has been evicted."
 ]
   },
+  "CACHE_ITEM_EXCEEDED" : {
+"message" : [
+  "Cannot  object , since its estimated size 
 exceeds the limitation ."

Review Comment:
   `...exceeds the maximum allowed size .` instead of `exceeds the 
limitation .`?
   
   Does this use proper size formats?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DRAFT] Two string types [spark]

2025-02-24 Thread via GitHub



github-actions[bot] closed pull request #48861: [DRAFT] Two string types
URL: https://github.com/apache/spark/pull/48861


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #50013:
URL: https://github.com/apache/spark/pull/50013#discussion_r1968708896


##
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLCache.scala:
##
@@ -21,23 +21,52 @@ import java.util.concurrent.{ConcurrentMap, TimeUnit}
 
 import com.google.common.cache.CacheBuilder
 
-import org.apache.spark.internal.Logging
+import org.apache.spark.internal.{Logging, LogKeys, MDC}
+import org.apache.spark.ml.Model
 import org.apache.spark.ml.util.ConnectHelper
+import org.apache.spark.sql.classic.SparkSession
+import org.apache.spark.sql.connect.config.Connect._
+import org.apache.spark.util.SizeEstimator
 
 /**
  * MLCache is for caching ML objects, typically for models and summaries 
evaluated by a model.
  */
-private[connect] class MLCache extends Logging {
+private[connect] class MLCache(session: SparkSession) extends Logging {
   private val helper = new ConnectHelper()
   private val helperID = "__ML_CONNECT_HELPER__"
+  private def conf = session.sessionState.conf
 
-  private val cachedModel: ConcurrentMap[String, Object] = CacheBuilder
-.newBuilder()
-.softValues()
-.maximumSize(MLCache.MAX_CACHED_ITEMS)
-.expireAfterAccess(MLCache.CACHE_TIMEOUT_MINUTE, TimeUnit.MINUTES)
-.build[String, Object]()
-.asMap()
+  private val cachedModel: ConcurrentMap[String, (Object, Long)] = {
+val builder = CacheBuilder.newBuilder().softValues()
+
+val cacheWeight = conf.getConf(CONNECT_SESSION_ML_CACHE_TOTAL_ITEM_SIZE)
+val cacheSize = conf.getConf(CONNECT_SESSION_ML_CACHE_SIZE)
+val timeOut = conf.getConf(CONNECT_SESSION_ML_CACHE_TIMEOUT)
+
+if (cacheWeight > 0) {
+  builder

Review Comment:
   What is the eviction policy here? LRU?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50795][SQL][FOLLOWUP] Set isParsing to false for the timestamp formatter in DESCRIBE AS JSON [spark]

2025-02-24 Thread via GitHub



yaooqinn commented on PR #50065:
URL: https://github.com/apache/spark/pull/50065#issuecomment-2680188025

   Merged to master/4.0, thank you @dongjoon-hyun @asl3 @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #50013:
URL: https://github.com/apache/spark/pull/50013#discussion_r1968704565


##
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLHandler.scala:
##
@@ -125,6 +127,15 @@ private[connect] object MLHandler extends Logging {
 val dataset = MLUtils.parseRelationProto(fitCmd.getDataset, 
sessionHolder)
 val estimator =
   MLUtils.getEstimator(sessionHolder, estimatorProto, 
Some(fitCmd.getParams))
+
+// pre-training model size check
+val maxSize = 
conf.getConf(Connect.CONNECT_SESSION_ML_CACHE_SINGLE_ITEM_SIZE)
+if (maxSize > 0) {
+  val estimatedSize = estimator.estimateModelSize(dataset)

Review Comment:
   How accurate is this? How much state is added to the model while fitting?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #50013:
URL: https://github.com/apache/spark/pull/50013#discussion_r1968705260


##
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLException.scala:
##
@@ -36,3 +36,17 @@ private[spark] case class 
MLCacheInvalidException(objectName: String)
   errorClass = "CONNECT_ML.CACHE_INVALID",
   messageParameters = Map("objectName" -> objectName),
   cause = null)
+
+private[spark] case class MlItemSizeExceededException(

Review Comment:
   Let's make sure the client side error mappings can deal with this new 
Exception. cc @itholic .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50795][SQL][FOLLOWUP] Set isParsing to false for the timestamp formatter in DESCRIBE AS JSON [spark]

2025-02-24 Thread via GitHub



dongjoon-hyun commented on PR #50065:
URL: https://github.com/apache/spark/pull/50065#issuecomment-2680207829

   Thank you, @yaooqinn and all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51306][TESTS] Fix test errors caused by improper DROP TABLE/VIEW in describe.sql [spark]

2025-02-24 Thread via GitHub



dongjoon-hyun commented on PR #50061:
URL: https://github.com/apache/spark/pull/50061#issuecomment-2680208684

   Thank you for adding JIRA issue ID, @yaooqinn .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #50013:
URL: https://github.com/apache/spark/pull/50013#discussion_r1968711609


##
mllib/src/main/scala/org/apache/spark/ml/util/Summary.scala:
##
@@ -18,11 +18,21 @@
 package org.apache.spark.ml.util
 
 import org.apache.spark.annotation.Since
+import org.apache.spark.util.KnownSizeEstimation
 
 /**
+ * For ml connect only.
  * Trait for the Summary
  * All the summaries should extend from this Summary in order to
  * support connect.
  */
 @Since("4.0.0")
-private[spark] trait Summary
+private[spark] trait Summary extends KnownSizeEstimation {
+
+  // A summary is normally a small object, with several RDDs or DataFrame.
+  // The SizeEstimator is likely to overestimate the size of the summary,
+  // because it will also count the underlying SparkSession and/or 
SparkContext,

Review Comment:
   This is why I don't like the SizeEstimator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #50013:
URL: https://github.com/apache/spark/pull/50013#discussion_r1968717393


##
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala:
##
@@ -504,6 +506,10 @@ object Vectors {
 
   /** Max number of nonzero entries used in computing hash code. */
   private[linalg] val MAX_HASH_NNZ = 128
+
+  private[ml] def getSparseSize(nnz: Long): Long = nnz * 12 + 20

Review Comment:
   Let's try to avoid hard coding constants.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51099][PYTHON][FOLLOWUP] Avoid logging when selector.select returns 0 without waiting the configured timeout [spark]

2025-02-24 Thread via GitHub



HyukjinKwon closed pull request #50071: [SPARK-51099][PYTHON][FOLLOWUP] Avoid 
logging when selector.select returns 0 without waiting the configured timeout
URL: https://github.com/apache/spark/pull/50071


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51099][PYTHON][FOLLOWUP][4.0] Avoid logging when selector.select returns 0 without waiting the configured timeout [spark]

2025-02-24 Thread via GitHub



HyukjinKwon closed pull request #50072: [SPARK-51099][PYTHON][FOLLOWUP][4.0] 
Avoid logging when selector.select returns 0 without waiting the configured 
timeout
URL: https://github.com/apache/spark/pull/50072


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50914][PYTHON][CONNECT] Match GRPC dependencies for Python-only master scheduled job [spark]

2025-02-24 Thread via GitHub



HyukjinKwon closed pull request #50058: [SPARK-50914][PYTHON][CONNECT] Match 
GRPC dependencies for Python-only master scheduled job
URL: https://github.com/apache/spark/pull/50058


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50914][PYTHON][CONNECT] Match GRPC dependencies for Python-only master scheduled job [spark]

2025-02-24 Thread via GitHub



HyukjinKwon commented on PR #50058:
URL: https://github.com/apache/spark/pull/50058#issuecomment-2680012404

   Merged to master and branch-4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50319] Reorder ResolveIdentifierClause and BindParameter rules [spark]

2025-02-24 Thread via GitHub



github-actions[bot] closed pull request #48849: [SPARK-50319] Reorder 
ResolveIdentifierClause and BindParameter rules
URL: https://github.com/apache/spark/pull/48849


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51306][TESTS] Fix test errors caused by improper DROP TABLE/VIEW in describe.sql [spark]

2025-02-24 Thread via GitHub



yaooqinn closed pull request #50061: [SPARK-51306][TESTS] Fix test errors 
caused by improper DROP TABLE/VIEW in describe.sql
URL: https://github.com/apache/spark/pull/50061


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #50013:
URL: https://github.com/apache/spark/pull/50013#discussion_r1968701252


##
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala:
##
@@ -313,4 +313,49 @@ object Connect {
   .internal()
   .booleanConf
   .createWithDefault(true)
+
+  val CONNECT_SESSION_ML_CACHE_TIMEOUT =
+buildConf("spark.connect.session.mlCache.timeout")

Review Comment:
   I am not sure if we can do this. This is a non-recoverable situation for a 
client. The client itself should tell us they are done with a model. I would 
suggest instead that we either fail and give them a list of models they can 
evict, or that we store old models on disk.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50795][SQL][FOLLOWUP] Set isParsing to false for the timestamp formatter in DESCRIBE AS JSON [spark]

2025-02-24 Thread via GitHub



yaooqinn closed pull request #50065: [SPARK-50795][SQL][FOLLOWUP] Set isParsing 
to false for the timestamp formatter in DESCRIBE AS JSON
URL: https://github.com/apache/spark/pull/50065


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #50013:
URL: https://github.com/apache/spark/pull/50013#discussion_r1968703858


##
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLHandler.scala:
##
@@ -125,6 +127,15 @@ private[connect] object MLHandler extends Logging {
 val dataset = MLUtils.parseRelationProto(fitCmd.getDataset, 
sessionHolder)
 val estimator =
   MLUtils.getEstimator(sessionHolder, estimatorProto, 
Some(fitCmd.getParams))
+
+// pre-training model size check
+val maxSize = 
conf.getConf(Connect.CONNECT_SESSION_ML_CACHE_SINGLE_ITEM_SIZE)
+if (maxSize > 0) {
+  val estimatedSize = estimator.estimateModelSize(dataset)
+  if (estimatedSize > maxSize) {
+throw MlItemSizeExceededException("fit", estimator.uid, 
estimatedSize, maxSize)

Review Comment:
   No we should not stop the session. The client can technically recover by 
increasing the limits.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51306][TESTS] Fix test errors caused by improper DROP TABLE/VIEW in describe.sql [spark]

2025-02-24 Thread via GitHub



yaooqinn commented on PR #50061:
URL: https://github.com/apache/spark/pull/50061#issuecomment-2680181236

   Thank you @dongjoon-hyun @LuciferYang, SPARK-51306 is attached.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50692][SQL][FOLLOWUP] Add the LPAD and RPAD pushdown support for H2 [spark]

2025-02-24 Thread via GitHub



beliefer commented on PR #50068:
URL: https://github.com/apache/spark/pull/50068#issuecomment-2680315941

   @dongjoon-hyun Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51302][CONNECT] Spark Connect supports JDBC should use the DataFrameReader API [spark]

2025-02-24 Thread via GitHub



beliefer commented on PR #50059:
URL: https://github.com/apache/spark/pull/50059#issuecomment-2680328835

   > Do you think you can add some test cases, @beliefer , to be clear what was 
the problem and to prevent a future regression?
   
   Spark Connect already have the test cases. This improvement is just to unify 
the code path and improve the maintenance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50856][SS][PYTHON][CONNECT] Spark Connect Support for TransformWithStateInPandas In Python [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #49560:
URL: https://github.com/apache/spark/pull/49560#discussion_r1968783156


##
sql/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -1031,6 +1031,26 @@ message GroupMap {
 
   // (Optional) The schema for the grouped state.
   optional DataType state_schema = 10;
+
+  // Below fields are used by TransformWithState and TransformWithStateInPandas
+  // (Optional) TransformWithState related parameters.
+  optional TransformWithStateInfo transform_with_state_info = 11;
+}
+
+// Additional input parameters used for TransformWithState operator.
+message TransformWithStateInfo {
+  // (Required) Time mode string for transformWithState.
+  string time_mode = 1;

Review Comment:
   I can't find that comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50856][SS][PYTHON][CONNECT] Spark Connect Support for TransformWithStateInPandas In Python [spark]

2025-02-24 Thread via GitHub



hvanhovell commented on code in PR #49560:
URL: https://github.com/apache/spark/pull/49560#discussion_r1968784789


##
sql/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -1031,6 +1031,26 @@ message GroupMap {
 
   // (Optional) The schema for the grouped state.
   optional DataType state_schema = 10;
+
+  // Below fields are used by TransformWithState and TransformWithStateInPandas
+  // (Optional) TransformWithState related parameters.
+  optional TransformWithStateInfo transform_with_state_info = 11;
+}
+
+// Additional input parameters used for TransformWithState operator.
+message TransformWithStateInfo {

Review Comment:
   @haiyangsun-db you could make it compatible by retaining the original 
fields. The only problem is that now you have two codepaths in the planner to 
maintain.
   
   @jingz-db we only have to be backwards compatible (older client should work 
on a newer service). As long as we make additive changes we should be good. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51300][PS][DOCS] Fix broken link for `ps.sql` [spark]

2025-02-24 Thread via GitHub



HyukjinKwon commented on PR #50056:
URL: https://github.com/apache/spark/pull/50056#issuecomment-2677664899

   Merged to master and branch-4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51300][PS][DOCS] Fix broken link for `ps.sql` [spark]

2025-02-24 Thread via GitHub



HyukjinKwon closed pull request #50056: [SPARK-51300][PS][DOCS] Fix broken link 
for `ps.sql`
URL: https://github.com/apache/spark/pull/50056


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP][SPARK-51302][CONNECT] Spark Connect supports JDBC should use the DataFrameReader API [spark]

2025-02-24 Thread via GitHub



beliefer opened a new pull request, #50059:
URL: https://github.com/apache/spark/pull/50059

   ### What changes were proposed in this pull request?
   This PR proposes to unify the calling to the DataFrameReader API in Spark 
Connect where supports the jdbc API.
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   
   
   ### How was this patch tested?
   GA.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   'No'.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]

2025-02-24 Thread via GitHub



LuciferYang commented on code in PR #50057:
URL: https://github.com/apache/spark/pull/50057#discussion_r1967181996


##
core/benchmarks/ZStandardBenchmark-results.txt:
##
@@ -2,48 +2,48 @@
 Benchmark ZStandardCompressionCodec
 

 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Compression 1 times at level 1 without buffer pool661  
  662   1  0.0   66080.2   1.0X
-Compression 1 times at level 2 without buffer pool701  
  702   1  0.0   70111.0   0.9X
-Compression 1 times at level 3 without buffer pool792  
  796   5  0.0   79224.8   0.8X
-Compression 1 times at level 1 with buffer pool   573  
  573   0  0.0   57276.4   1.2X
-Compression 1 times at level 2 with buffer pool   602  
  602   0  0.0   60206.9   1.1X
-Compression 1 times at level 3 with buffer pool   707  
  707   1  0.0   70665.0   0.9X
+Compression 1 times at level 1 without buffer pool266  
  267   2  0.0   26572.9   1.0X
+Compression 1 times at level 2 without buffer pool691  
  692   1  0.0   69069.2   0.4X
+Compression 1 times at level 3 without buffer pool792  
  802  10  0.0   79240.5   0.3X
+Compression 1 times at level 1 with buffer pool   572  
  573   1  0.0   57238.7   0.5X
+Compression 1 times at level 2 with buffer pool   604  
  605   2  0.0   60369.3   0.4X
+Compression 1 times at level 3 with buffer pool   735  
  738   2  0.0   73544.0   0.4X
 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Decompression 1 times from level 1 without buffer pool635  
  635   0  0.0   63471.5   1.0X
-Decompression 1 times from level 2 without buffer pool637  
  638   1  0.0   63693.2   1.0X
-Decompression 1 times from level 3 without buffer pool637  
  638   1  0.0   63687.9   1.0X
-Decompression 1 times from level 1 with buffer pool   545  
  545   0  0.0   54463.9   1.2X
-Decompression 1 times from level 2 with buffer pool   544  
  545   1  0.0   54405.3   1.2X
-Decompression 1 times from level 3 with buffer pool   544  
  545   1  0.0   54399.6   1.2X
+Decompression 1 times from level 1 without buffer pool595  
  595   0  0.0   59493.7   1.0X
+Decompression 1 times from level 2 without buffer pool596  
  596   1  0.0   59575.1   1.0X
+Decompression 1 times from level 3 without buffer pool596  
  597   1  0.0   59607.5   1.0X
+Decompression 1 times from level 1 with buffer pool   541  
  542   1  0.0   54117.6   1.1X
+Decompression 1 times from level 2 with buffer pool   541  
  542   1  0.0   54088.7   1.1X
+Decompression 1 times from level 3 with buffer pool   541  
  542   0  0.0   54106.0   1.1X
 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Parallel Compression at level 3:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parallel Compres

Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]

2025-02-24 Thread via GitHub



yaooqinn commented on code in PR #50057:
URL: https://github.com/apache/spark/pull/50057#discussion_r1967210507


##
core/benchmarks/ZStandardBenchmark-results.txt:
##
@@ -2,48 +2,48 @@
 Benchmark ZStandardCompressionCodec
 

 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Compression 1 times at level 1 without buffer pool661  
  662   1  0.0   66080.2   1.0X
-Compression 1 times at level 2 without buffer pool701  
  702   1  0.0   70111.0   0.9X
-Compression 1 times at level 3 without buffer pool792  
  796   5  0.0   79224.8   0.8X
-Compression 1 times at level 1 with buffer pool   573  
  573   0  0.0   57276.4   1.2X
-Compression 1 times at level 2 with buffer pool   602  
  602   0  0.0   60206.9   1.1X
-Compression 1 times at level 3 with buffer pool   707  
  707   1  0.0   70665.0   0.9X
+Compression 1 times at level 1 without buffer pool266  
  267   2  0.0   26572.9   1.0X
+Compression 1 times at level 2 without buffer pool691  
  692   1  0.0   69069.2   0.4X
+Compression 1 times at level 3 without buffer pool792  
  802  10  0.0   79240.5   0.3X
+Compression 1 times at level 1 with buffer pool   572  
  573   1  0.0   57238.7   0.5X
+Compression 1 times at level 2 with buffer pool   604  
  605   2  0.0   60369.3   0.4X
+Compression 1 times at level 3 with buffer pool   735  
  738   2  0.0   73544.0   0.4X
 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Decompression 1 times from level 1 without buffer pool635  
  635   0  0.0   63471.5   1.0X
-Decompression 1 times from level 2 without buffer pool637  
  638   1  0.0   63693.2   1.0X
-Decompression 1 times from level 3 without buffer pool637  
  638   1  0.0   63687.9   1.0X
-Decompression 1 times from level 1 with buffer pool   545  
  545   0  0.0   54463.9   1.2X
-Decompression 1 times from level 2 with buffer pool   544  
  545   1  0.0   54405.3   1.2X
-Decompression 1 times from level 3 with buffer pool   544  
  545   1  0.0   54399.6   1.2X
+Decompression 1 times from level 1 without buffer pool595  
  595   0  0.0   59493.7   1.0X
+Decompression 1 times from level 2 without buffer pool596  
  596   1  0.0   59575.1   1.0X
+Decompression 1 times from level 3 without buffer pool596  
  597   1  0.0   59607.5   1.0X
+Decompression 1 times from level 1 with buffer pool   541  
  542   1  0.0   54117.6   1.1X
+Decompression 1 times from level 2 with buffer pool   541  
  542   1  0.0   54088.7   1.1X
+Decompression 1 times from level 3 with buffer pool   541  
  542   0  0.0   54106.0   1.1X
 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Parallel Compression at level 3:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parallel Compressio

Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]

2025-02-24 Thread via GitHub



LuciferYang commented on code in PR #50057:
URL: https://github.com/apache/spark/pull/50057#discussion_r1967217142


##
core/benchmarks/ZStandardBenchmark-results.txt:
##
@@ -2,48 +2,48 @@
 Benchmark ZStandardCompressionCodec
 

 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Compression 1 times at level 1 without buffer pool661  
  662   1  0.0   66080.2   1.0X
-Compression 1 times at level 2 without buffer pool701  
  702   1  0.0   70111.0   0.9X
-Compression 1 times at level 3 without buffer pool792  
  796   5  0.0   79224.8   0.8X
-Compression 1 times at level 1 with buffer pool   573  
  573   0  0.0   57276.4   1.2X
-Compression 1 times at level 2 with buffer pool   602  
  602   0  0.0   60206.9   1.1X
-Compression 1 times at level 3 with buffer pool   707  
  707   1  0.0   70665.0   0.9X
+Compression 1 times at level 1 without buffer pool266  
  267   2  0.0   26572.9   1.0X
+Compression 1 times at level 2 without buffer pool691  
  692   1  0.0   69069.2   0.4X
+Compression 1 times at level 3 without buffer pool792  
  802  10  0.0   79240.5   0.3X
+Compression 1 times at level 1 with buffer pool   572  
  573   1  0.0   57238.7   0.5X
+Compression 1 times at level 2 with buffer pool   604  
  605   2  0.0   60369.3   0.4X
+Compression 1 times at level 3 with buffer pool   735  
  738   2  0.0   73544.0   0.4X
 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Decompression 1 times from level 1 without buffer pool635  
  635   0  0.0   63471.5   1.0X
-Decompression 1 times from level 2 without buffer pool637  
  638   1  0.0   63693.2   1.0X
-Decompression 1 times from level 3 without buffer pool637  
  638   1  0.0   63687.9   1.0X
-Decompression 1 times from level 1 with buffer pool   545  
  545   0  0.0   54463.9   1.2X
-Decompression 1 times from level 2 with buffer pool   544  
  545   1  0.0   54405.3   1.2X
-Decompression 1 times from level 3 with buffer pool   544  
  545   1  0.0   54399.6   1.2X
+Decompression 1 times from level 1 without buffer pool595  
  595   0  0.0   59493.7   1.0X
+Decompression 1 times from level 2 without buffer pool596  
  596   1  0.0   59575.1   1.0X
+Decompression 1 times from level 3 without buffer pool596  
  597   1  0.0   59607.5   1.0X
+Decompression 1 times from level 1 with buffer pool   541  
  542   1  0.0   54117.6   1.1X
+Decompression 1 times from level 2 with buffer pool   541  
  542   1  0.0   54088.7   1.1X
+Decompression 1 times from level 3 with buffer pool   541  
  542   0  0.0   54106.0   1.1X
 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Parallel Compression at level 3:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parallel Compres

[PR] Add rpad and lpad support for PostgreDialect and MsSQLServerDialect [spark]

2025-02-24 Thread via GitHub



milosstojanovic opened a new pull request, #50060:
URL: https://github.com/apache/spark/pull/50060

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50856][SS][PYTHON][CONNECT] Spark Connect Support for TransformWithStateInPandas In Python [spark]

2025-02-24 Thread via GitHub



haiyangsun-db commented on code in PR #49560:
URL: https://github.com/apache/spark/pull/49560#discussion_r1967297789


##
sql/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -1031,6 +1031,26 @@ message GroupMap {
 
   // (Optional) The schema for the grouped state.
   optional DataType state_schema = 10;
+
+  // Below fields are used by TransformWithState and TransformWithStateInPandas
+  // (Optional) TransformWithState related parameters.
+  optional TransformWithStateInfo transform_with_state_info = 11;
+}
+
+// Additional input parameters used for TransformWithState operator.
+message TransformWithStateInfo {

Review Comment:
   `GroupMap` may also be used for a normal "flatMapGroups" without state - so 
ideally we should have just one optional field in "GroupMap" such as:
   ```
   oneof stateInfo {
   MapGroupsWithStateInfo = xx;
   TransformWithStateInfo = yy;
   }
   ```
   
   And the stateInfo would clearly indicate three different types:
   1. when it is none, it is a normal "flatMapGroups"
   2. when it is MapGroupsWithStateInfo, it's for "MapGroupsWithState" or 
"FlatMapGroupsWithState"
   3. otherwise, it's for transformWithState.
   
   This would look better, however, it is not backward compatible.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [MINOR][TESTS] Fix test errors caused by improper DROP TABLE/VIEW in describe.sql [spark]

2025-02-24 Thread via GitHub



yaooqinn opened a new pull request, #50061:
URL: https://github.com/apache/spark/pull/50061

   
   
   ### What changes were proposed in this pull request?
   
   This PR fixes test errors caused by improper DROP TABLE/VIEW in describe.sql
   
   - Table Not Found Error when dropping views after dropping the base table
   - Table Already Exists Error w/o proper drop
   
   
   
   ### Why are the changes needed?
   test fix
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   
   
   the existing tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Add rpad and lpad support for PostgreDialect and MsSQLServerDialect [spark]

2025-02-24 Thread via GitHub



milosstojanovic closed pull request #50060: Add rpad and lpad support for 
PostgreDialect and MsSQLServerDialect
URL: https://github.com/apache/spark/pull/50060


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51278][PYTHON] Use appropriate structure of JSON format for `PySparkLogger` [spark]

2025-02-24 Thread via GitHub



itholic closed pull request #50038: [SPARK-51278][PYTHON] Use appropriate 
structure of JSON format for `PySparkLogger`
URL: https://github.com/apache/spark/pull/50038


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51278][PYTHON] Use appropriate structure of JSON format for `PySparkLogger` [spark]

2025-02-24 Thread via GitHub



itholic commented on PR #50038:
URL: https://github.com/apache/spark/pull/50038#issuecomment-2677920582

   Merged to master and branch-4.0.
   Thanks @ueshin @HyukjinKwon for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]

2025-02-24 Thread via GitHub



beliefer commented on code in PR #50057:
URL: https://github.com/apache/spark/pull/50057#discussion_r1967317270


##
core/benchmarks/ZStandardBenchmark-results.txt:
##
@@ -2,48 +2,48 @@
 Benchmark ZStandardCompressionCodec
 

 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Compression 1 times at level 1 without buffer pool661  
  662   1  0.0   66080.2   1.0X
-Compression 1 times at level 2 without buffer pool701  
  702   1  0.0   70111.0   0.9X
-Compression 1 times at level 3 without buffer pool792  
  796   5  0.0   79224.8   0.8X
-Compression 1 times at level 1 with buffer pool   573  
  573   0  0.0   57276.4   1.2X
-Compression 1 times at level 2 with buffer pool   602  
  602   0  0.0   60206.9   1.1X
-Compression 1 times at level 3 with buffer pool   707  
  707   1  0.0   70665.0   0.9X
+Compression 1 times at level 1 without buffer pool266  
  267   2  0.0   26572.9   1.0X
+Compression 1 times at level 2 without buffer pool691  
  692   1  0.0   69069.2   0.4X
+Compression 1 times at level 3 without buffer pool792  
  802  10  0.0   79240.5   0.3X
+Compression 1 times at level 1 with buffer pool   572  
  573   1  0.0   57238.7   0.5X
+Compression 1 times at level 2 with buffer pool   604  
  605   2  0.0   60369.3   0.4X
+Compression 1 times at level 3 with buffer pool   735  
  738   2  0.0   73544.0   0.4X
 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Decompression 1 times from level 1 without buffer pool635  
  635   0  0.0   63471.5   1.0X
-Decompression 1 times from level 2 without buffer pool637  
  638   1  0.0   63693.2   1.0X
-Decompression 1 times from level 3 without buffer pool637  
  638   1  0.0   63687.9   1.0X
-Decompression 1 times from level 1 with buffer pool   545  
  545   0  0.0   54463.9   1.2X
-Decompression 1 times from level 2 with buffer pool   544  
  545   1  0.0   54405.3   1.2X
-Decompression 1 times from level 3 with buffer pool   544  
  545   1  0.0   54399.6   1.2X
+Decompression 1 times from level 1 without buffer pool595  
  595   0  0.0   59493.7   1.0X
+Decompression 1 times from level 2 without buffer pool596  
  596   1  0.0   59575.1   1.0X
+Decompression 1 times from level 3 without buffer pool596  
  597   1  0.0   59607.5   1.0X
+Decompression 1 times from level 1 with buffer pool   541  
  542   1  0.0   54117.6   1.1X
+Decompression 1 times from level 2 with buffer pool   541  
  542   1  0.0   54088.7   1.1X
+Decompression 1 times from level 3 with buffer pool   541  
  542   0  0.0   54106.0   1.1X
 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Parallel Compression at level 3:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parallel Compressio

[PR] [SPARK-51304][DOCS][PYTHON] Use `getCondition` instead of `getErrorClass` in contribution guide [spark]

2025-02-24 Thread via GitHub



itholic opened a new pull request, #50062:
URL: https://github.com/apache/spark/pull/50062

   
   
   ### What changes were proposed in this pull request?
   
   This PR proposes to use `getCondition` instead of `getErrorClass` in 
contribution guide
   
   ### Why are the changes needed?
   
   We deprecated `getErrorClass`
   
   ### Does this PR introduce _any_ user-facing change?
   
   No API changes, but the user-facing doc will be updated
   
   ### How was this patch tested?
   
   The existing doc build CI
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DRAFT] Resolve default string producing expressions [spark]

2025-02-24 Thread via GitHub



stefankandic commented on code in PR #50053:
URL: https://github.com/apache/spark/pull/50053#discussion_r1967363581


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDDLCommandStringTypes.scala:
##
@@ -28,6 +29,10 @@ import org.apache.spark.sql.types.{DataType, StringType}
  * collation from the corresponding object (table/view -> schema -> catalog).
  */
 object ResolveDDLCommandStringTypes extends Rule[LogicalPlan] {
+  // Tag to mark expressions that have been cast to a new type so that we can
+  // avoid infinite recursion when resolving the same expression multiple 
times.
+  private val CAST_ADDED_TAG = new 
TreeNodeTag[Unit]("defaultStringExpressionCastAdded")

Review Comment:
   Is there an alternative approach which doesn't lead to adding the cast 
infinitely?
   cc: @cloud-fan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-50098][PYTHON][FOLLOW-UP] Update _minimum_googleapis_common_protos_version in setup.py for pyspark-client [spark]

2025-02-24 Thread via GitHub



HyukjinKwon opened a new pull request, #50063:
URL: https://github.com/apache/spark/pull/50063

   ### What changes were proposed in this pull request?
   
   This PR is a followup of https://github.com/apache/spark/pull/48643 that 
updates _minimum_googleapis_common_protos_version in setup.py for pyspark-client
   
   ### Why are the changes needed?
   
   To match the version with pyspark.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, `pyspark-client` has not been released yet.
   
   ### How was this patch tested?
   
   It will be tested in "Debug Build / Spark Connect Python-only (master, 
Python 3.11) " build.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-50015][PYTHON][FOLLOW-UP] Update _minimum_grpc_version in setup.py for pyspark-client [spark]

2025-02-24 Thread via GitHub



HyukjinKwon opened a new pull request, #50064:
URL: https://github.com/apache/spark/pull/50064

   ### What changes were proposed in this pull request?
   
   This PR is a followup of https://github.com/apache/spark/pull/48524 that 
updates _minimum_grpc_version in setup.py for pyspark-client
   
   ### Why are the changes needed?
   
   To match the version with pyspark.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, `pyspark-client` has not been released yet.
   
   ### How was this patch tested?
   
   It will be tested in "Debug Build / Spark Connect Python-only (master, 
Python 3.11) " build.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50098][PYTHON][FOLLOW-UP] Update _minimum_googleapis_common_protos_version in setup.py for pyspark-client [spark]

2025-02-24 Thread via GitHub



HyukjinKwon commented on PR #50063:
URL: https://github.com/apache/spark/pull/50063#issuecomment-2678115467

   Merged to master and branch-4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50098][PYTHON][FOLLOW-UP] Update _minimum_googleapis_common_protos_version in setup.py for pyspark-client [spark]

2025-02-24 Thread via GitHub



HyukjinKwon closed pull request #50063: [SPARK-50098][PYTHON][FOLLOW-UP] Update 
_minimum_googleapis_common_protos_version in setup.py for pyspark-client
URL: https://github.com/apache/spark/pull/50063


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50015][PYTHON][FOLLOW-UP] Update _minimum_grpc_version in setup.py for pyspark-client [spark]

2025-02-24 Thread via GitHub



HyukjinKwon commented on PR #50064:
URL: https://github.com/apache/spark/pull/50064#issuecomment-2678117982

   Merged to master and brnach-4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51302][CONNECT] Spark Connect supports JDBC should use the DataFrameReader API [spark]

2025-02-24 Thread via GitHub



beliefer commented on PR #50059:
URL: https://github.com/apache/spark/pull/50059#issuecomment-2678117446

   ping @HyukjinKwon @zhengruifeng @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-50795][SQL][FOLLOWUP] Set isParsing to false for the timestamp formatter in DESCRIBE AS JSON [spark]

2025-02-24 Thread via GitHub



yaooqinn opened a new pull request, #50065:
URL: https://github.com/apache/spark/pull/50065

   ### What changes were proposed in this pull request?
   
   This PR set isParsing to false for the timestamp formatter in DESCRIBE AS 
JSON, because the formatter is not used for parsing datetime strings
   
   ### Why are the changes needed?
   
   
   Although it does not affect the final output due to the current fmt we use 
now being w/o 'S' portion, it can prevent potential bugs if we store/display 
higher-precision timestamps. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   
   Existing tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50015][PYTHON][FOLLOW-UP] Update _minimum_grpc_version in setup.py for pyspark-client [spark]

2025-02-24 Thread via GitHub



HyukjinKwon closed pull request #50064: [SPARK-50015][PYTHON][FOLLOW-UP] Update 
_minimum_grpc_version in setup.py for pyspark-client
URL: https://github.com/apache/spark/pull/50064


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50785][SQL] Refactor FOR statement to utilize local variables properly. [spark]

2025-02-24 Thread via GitHub



dusantism-db commented on code in PR #50026:
URL: https://github.com/apache/spark/pull/50026#discussion_r1967448405


##
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala:
##
@@ -206,6 +207,15 @@ class TriggerToExceptionHandlerMap(
   def getNotFoundHandler: Option[ExceptionHandlerExec] = notFoundHandler
 }
 
+object TriggerToExceptionHandlerMap {
+  def empty: TriggerToExceptionHandlerMap = new TriggerToExceptionHandlerMap(

Review Comment:
   This is the definition of a helper method which returns a new 
`TriggerToExceptionHandlerMap`. There is no state, we are using `def` not `val`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50785][SQL] Refactor FOR statement to utilize local variables properly. [spark]

2025-02-24 Thread via GitHub



dusantism-db commented on code in PR #50026:
URL: https://github.com/apache/spark/pull/50026#discussion_r1967452308


##
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala:
##
@@ -206,6 +207,15 @@ class TriggerToExceptionHandlerMap(
   def getNotFoundHandler: Option[ExceptionHandlerExec] = notFoundHandler
 }
 
+object TriggerToExceptionHandlerMap {
+  def empty: TriggerToExceptionHandlerMap = new TriggerToExceptionHandlerMap(

Review Comment:
   Maybe it should have parenthesis for clarity - `empty()` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]

2025-02-24 Thread via GitHub



yaooqinn commented on code in PR #50057:
URL: https://github.com/apache/spark/pull/50057#discussion_r1967455912


##
core/benchmarks/ZStandardBenchmark-results.txt:
##
@@ -2,48 +2,48 @@
 Benchmark ZStandardCompressionCodec
 

 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Compression 1 times at level 1 without buffer pool661  
  662   1  0.0   66080.2   1.0X
-Compression 1 times at level 2 without buffer pool701  
  702   1  0.0   70111.0   0.9X
-Compression 1 times at level 3 without buffer pool792  
  796   5  0.0   79224.8   0.8X
-Compression 1 times at level 1 with buffer pool   573  
  573   0  0.0   57276.4   1.2X
-Compression 1 times at level 2 with buffer pool   602  
  602   0  0.0   60206.9   1.1X
-Compression 1 times at level 3 with buffer pool   707  
  707   1  0.0   70665.0   0.9X
+Compression 1 times at level 1 without buffer pool266  
  267   2  0.0   26572.9   1.0X
+Compression 1 times at level 2 without buffer pool691  
  692   1  0.0   69069.2   0.4X
+Compression 1 times at level 3 without buffer pool792  
  802  10  0.0   79240.5   0.3X
+Compression 1 times at level 1 with buffer pool   572  
  573   1  0.0   57238.7   0.5X
+Compression 1 times at level 2 with buffer pool   604  
  605   2  0.0   60369.3   0.4X
+Compression 1 times at level 3 with buffer pool   735  
  738   2  0.0   73544.0   0.4X
 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Decompression 1 times from level 1 without buffer pool635  
  635   0  0.0   63471.5   1.0X
-Decompression 1 times from level 2 without buffer pool637  
  638   1  0.0   63693.2   1.0X
-Decompression 1 times from level 3 without buffer pool637  
  638   1  0.0   63687.9   1.0X
-Decompression 1 times from level 1 with buffer pool   545  
  545   0  0.0   54463.9   1.2X
-Decompression 1 times from level 2 with buffer pool   544  
  545   1  0.0   54405.3   1.2X
-Decompression 1 times from level 3 with buffer pool   544  
  545   1  0.0   54399.6   1.2X
+Decompression 1 times from level 1 without buffer pool595  
  595   0  0.0   59493.7   1.0X
+Decompression 1 times from level 2 without buffer pool596  
  596   1  0.0   59575.1   1.0X
+Decompression 1 times from level 3 without buffer pool596  
  597   1  0.0   59607.5   1.0X
+Decompression 1 times from level 1 with buffer pool   541  
  542   1  0.0   54117.6   1.1X
+Decompression 1 times from level 2 with buffer pool   541  
  542   1  0.0   54088.7   1.1X
+Decompression 1 times from level 3 with buffer pool   541  
  542   0  0.0   54106.0   1.1X
 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Parallel Compression at level 3:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parallel Compressio

Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]

2025-02-24 Thread via GitHub



pan3793 commented on code in PR #50057:
URL: https://github.com/apache/spark/pull/50057#discussion_r1967470537


##
core/benchmarks/ZStandardBenchmark-results.txt:
##
@@ -2,48 +2,48 @@
 Benchmark ZStandardCompressionCodec
 

 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Compression 1 times at level 1 without buffer pool661  
  662   1  0.0   66080.2   1.0X
-Compression 1 times at level 2 without buffer pool701  
  702   1  0.0   70111.0   0.9X
-Compression 1 times at level 3 without buffer pool792  
  796   5  0.0   79224.8   0.8X
-Compression 1 times at level 1 with buffer pool   573  
  573   0  0.0   57276.4   1.2X
-Compression 1 times at level 2 with buffer pool   602  
  602   0  0.0   60206.9   1.1X
-Compression 1 times at level 3 with buffer pool   707  
  707   1  0.0   70665.0   0.9X
+Compression 1 times at level 1 without buffer pool266  
  267   2  0.0   26572.9   1.0X
+Compression 1 times at level 2 without buffer pool691  
  692   1  0.0   69069.2   0.4X
+Compression 1 times at level 3 without buffer pool792  
  802  10  0.0   79240.5   0.3X
+Compression 1 times at level 1 with buffer pool   572  
  573   1  0.0   57238.7   0.5X
+Compression 1 times at level 2 with buffer pool   604  
  605   2  0.0   60369.3   0.4X
+Compression 1 times at level 3 with buffer pool   735  
  738   2  0.0   73544.0   0.4X
 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Decompression 1 times from level 1 without buffer pool635  
  635   0  0.0   63471.5   1.0X
-Decompression 1 times from level 2 without buffer pool637  
  638   1  0.0   63693.2   1.0X
-Decompression 1 times from level 3 without buffer pool637  
  638   1  0.0   63687.9   1.0X
-Decompression 1 times from level 1 with buffer pool   545  
  545   0  0.0   54463.9   1.2X
-Decompression 1 times from level 2 with buffer pool   544  
  545   1  0.0   54405.3   1.2X
-Decompression 1 times from level 3 with buffer pool   544  
  545   1  0.0   54399.6   1.2X
+Decompression 1 times from level 1 without buffer pool595  
  595   0  0.0   59493.7   1.0X
+Decompression 1 times from level 2 without buffer pool596  
  596   1  0.0   59575.1   1.0X
+Decompression 1 times from level 3 without buffer pool596  
  597   1  0.0   59607.5   1.0X
+Decompression 1 times from level 1 with buffer pool   541  
  542   1  0.0   54117.6   1.1X
+Decompression 1 times from level 2 with buffer pool   541  
  542   1  0.0   54088.7   1.1X
+Decompression 1 times from level 3 with buffer pool   541  
  542   0  0.0   54106.0   1.1X
 
-OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Parallel Compression at level 3:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parallel Compression

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-24 Thread via GitHub



Madhukar525722 commented on PR #50022:
URL: https://github.com/apache/spark/pull/50022#issuecomment-2678157954

   HI @pan3793 , while testing we are facing a warning
   
   `[spark3-client]$ spark-sql --master yarn --deploy-mode client 
--driver-memory 4g --executor-memory 4g --conf 
spark.hadoop.hive.thrift.client.max.message.size=1gb`
   
   `Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   25/02/24 07:45:46 WARN HiveConf: HiveConf of name 
hive.thrift.client.max.message.size does not exist
   25/02/24 07:45:49 WARN HiveConf: HiveConf of name 
hive.thrift.client.max.message.size does not exist`
   
   We have already defined the conf in hive-site.xml
   `
 hive.thrift.client.max.message.size
 1gb
   `
   
   So, the error message for table persists.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-24 Thread via GitHub



pan3793 commented on PR #50022:
URL: https://github.com/apache/spark/pull/50022#issuecomment-2678172344

   @Madhukar525722 so it works? 
   
   it's a warning message, not an error, and it seems reasonable to me. 
   
   > 07:45:49 WARN HiveConf: HiveConf of name 
hive.thrift.client.max.message.size does not exist
   
   BTW, use three backticks to quote the code blocks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-24 Thread via GitHub



Madhukar525722 commented on PR #50022:
URL: https://github.com/apache/spark/pull/50022#issuecomment-2678181886

   It didnt worked @pan3793 , I am suspecting that the config setup didnt 
happend thats why it resulted in the same old behaviour. Apart from that other 
error logs are still same
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51305][CONNECT] Improve the code for createObservedMetricsResponse [spark]

2025-02-24 Thread via GitHub



beliefer opened a new pull request, #50066:
URL: https://github.com/apache/spark/pull/50066

   ### What changes were proposed in this pull request?
   This PR proposes to improve the code for `createObservedMetricsResponse`.
   
   
   ### Why are the changes needed?
   There exists a duplicate judgement in loop, we should eliminate it.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   
   
   ### How was this patch tested?
   GA.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   'No'.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-24 Thread via GitHub



pan3793 commented on code in PR #50022:
URL: https://github.com/apache/spark/pull/50022#discussion_r1967508152


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1407,13 +1408,74 @@ private[hive] object HiveClientImpl extends Logging {
   case _ =>
 new HiveConf(conf, classOf[HiveConf])
 }
-try {
+val hive = try {
   Hive.getWithoutRegisterFns(hiveConf)
 } catch {
   // SPARK-37069: not all Hive versions have the above method (e.g., Hive 
2.3.9 has it but
-  // 2.3.8 don't), therefore here we fallback when encountering the 
exception.
+  // 2.3.8 doesn't), therefore here we fallback when encountering the 
exception.
   case _: NoSuchMethodError =>
 Hive.get(hiveConf)
 }
+
+// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size 
when
+// `hive.thrift.client.max.message.size` is set and the value is positive
+Option(hiveConf.get("hive.thrift.client.max.message.size"))
+  .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0)
+  .foreach { maxMessageSize =>
+logDebug(s"Trying to set metastore client thrift max message to 
$maxMessageSize")
+configureMaxThriftMessageSize(hiveConf, hive.getMSC, maxMessageSize)
+  }
+
+hive
+  }
+
+  private def getFieldValue[T](obj: Any, fieldName: String): T = {
+val field = obj.getClass.getDeclaredField(fieldName)
+field.setAccessible(true)
+field.get(obj).asInstanceOf[T]
+  }
+
+  private def findMethod(klass: Class[_], name: String, args: Class[_]*): 
Method = {
+val method = klass.getDeclaredMethod(name, args: _*)
+method.setAccessible(true)
+method
+  }
+
+  // SPARK-49489: a surgery for Hive 2.3.10 due to lack of HIVE-26633
+  private def configureMaxThriftMessageSize(
+  hiveConf: HiveConf, msClient: IMetaStoreClient, maxMessageSize: Int): 
Unit = try {
+msClient match {
+  // Hive uses Java Dynamic Proxy to enhance the MetaStoreClient to 
support synchronization
+  // and retrying, we should unwrap and access the real MetaStoreClient 
instance firstly
+  case proxy if JdkProxy.isProxyClass(proxy.getClass) =>
+JdkProxy.getInvocationHandler(proxy) match {
+  case syncHandler if 
syncHandler.getClass.getName.endsWith("SynchronizedHandler") =>
+val realMsc = getFieldValue[IMetaStoreClient](syncHandler, 
"client")
+configureMaxThriftMessageSize(hiveConf, realMsc, maxMessageSize)
+  case retryHandler: RetryingMetaStoreClient =>
+val realMsc = getFieldValue[IMetaStoreClient](retryHandler, "base")
+configureMaxThriftMessageSize(hiveConf, realMsc, maxMessageSize)
+  case _ =>
+}
+  case msc: HiveMetaStoreClient if !msc.isLocalMetaStore =>
+msc.getTTransport match {
+  case t: TEndpointTransport =>
+val currentMaxMessageSize = t.getConfiguration.getMaxMessageSize
+if (currentMaxMessageSize != maxMessageSize) {
+  logDebug("Change the current metastore client thrift max message 
size from " +

Review Comment:
   @Madhukar525722 could you please try the latest patch, and monitor if this 
log is printed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-24 Thread via GitHub



pan3793 commented on code in PR #50022:
URL: https://github.com/apache/spark/pull/50022#discussion_r1967508738


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1407,13 +1408,74 @@ private[hive] object HiveClientImpl extends Logging {
   case _ =>
 new HiveConf(conf, classOf[HiveConf])
 }
-try {
+val hive = try {
   Hive.getWithoutRegisterFns(hiveConf)
 } catch {
   // SPARK-37069: not all Hive versions have the above method (e.g., Hive 
2.3.9 has it but
-  // 2.3.8 don't), therefore here we fallback when encountering the 
exception.
+  // 2.3.8 doesn't), therefore here we fallback when encountering the 
exception.
   case _: NoSuchMethodError =>
 Hive.get(hiveConf)
 }
+
+// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size 
when
+// `hive.thrift.client.max.message.size` is set and the value is positive
+Option(hiveConf.get("hive.thrift.client.max.message.size"))
+  .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0)
+  .foreach { maxMessageSize =>
+logDebug(s"Trying to set metastore client thrift max message to 
$maxMessageSize")

Review Comment:
   @Madhukar525722 also this one



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]

2025-02-24 Thread via GitHub



beliefer commented on code in PR #50057:
URL: https://github.com/apache/spark/pull/50057#discussion_r1967516710


##
core/benchmarks/ZStandardBenchmark-jdk21-results.txt:
##
@@ -2,48 +2,48 @@
 Benchmark ZStandardCompressionCodec
 

 
-OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Compression 1 times at level 1 without buffer pool656  
  668  12  0.0   65591.6   1.0X
-Compression 1 times at level 2 without buffer pool709  
  711   2  0.0   70934.6   0.9X
-Compression 1 times at level 3 without buffer pool814  
  818   5  0.0   81370.9   0.8X
-Compression 1 times at level 1 with buffer pool   601  
  603   2  0.0   60100.1   1.1X
-Compression 1 times at level 2 with buffer pool   634  
  636   2  0.0   63449.9   1.0X
-Compression 1 times at level 3 with buffer pool   748  
  753   5  0.0   74789.7   0.9X
+Compression 1 times at level 1 without buffer pool650  
  666  11  0.0   65044.6   1.0X
+Compression 1 times at level 2 without buffer pool701  
  703   2  0.0   70071.2   0.9X
+Compression 1 times at level 3 without buffer pool804  
  807   2  0.0   80430.6   0.8X
+Compression 1 times at level 1 with buffer pool   598  
  599   1  0.0   59810.5   1.1X
+Compression 1 times at level 2 with buffer pool   631  
  640  17  0.0   63096.7   1.0X
+Compression 1 times at level 3 with buffer pool   751  
  753   4  0.0   75051.1   0.9X
 
-OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Decompression 1 times from level 1 without buffer pool817  
  818   1  0.0   81723.2   1.0X
-Decompression 1 times from level 2 without buffer pool817  
  818   1  0.0   81729.4   1.0X
-Decompression 1 times from level 3 without buffer pool817  
  818   1  0.0   81719.7   1.0X
-Decompression 1 times from level 1 with buffer pool   749  
  757  14  0.0   74864.9   1.1X
-Decompression 1 times from level 2 with buffer pool   748  
  749   1  0.0   74789.1   1.1X
-Decompression 1 times from level 3 with buffer pool   748  
  749   1  0.0   74811.6   1.1X
+Decompression 1 times from level 1 without buffer pool823  
  831  14  0.0   82254.2   1.0X
+Decompression 1 times from level 2 without buffer pool824  
  827   5  0.0   82359.1   1.0X
+Decompression 1 times from level 3 without buffer pool819  
  821   2  0.0   81925.1   1.0X
+Decompression 1 times from level 1 with buffer pool   753  
  755   3  0.0   75284.1   1.1X
+Decompression 1 times from level 2 with buffer pool   753  
  755   1  0.0   75344.7   1.1X
+Decompression 1 times from level 3 with buffer pool   753  
  754   1  0.0   75263.0   1.1X
 
-OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Parallel Compression at level 3:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parallel Compressio

Re: [PR] [WIP][SQL] Add rpad and lpad support for PostgresDialect and MsSQLServerDialect expression pushdown [spark]

2025-02-24 Thread via GitHub



beliefer commented on PR #50060:
URL: https://github.com/apache/spark/pull/50060#issuecomment-2678235174

   Please create an issue for track down.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51078][SPARK-50963][ML][PYTHON][CONNECT][TESTS][FOLLOW-UP] Add back tests for default value [spark]

2025-02-24 Thread via GitHub



zhengruifeng opened a new pull request, #50067:
URL: https://github.com/apache/spark/pull/50067

   
   
   ### What changes were proposed in this pull request?
   add back tests deleted in 
https://github.com/apache/spark/commit/e0a7db2d2a7d295f933f9fc2d5605c5e59c58aa7#diff-50e109673576cc6d4f8727f54076b0884b93b759c49d98c908581db7093cb5ab
   
   
   ### Why are the changes needed?
   for coverage, `StopWordRemover`'s default value implementation is special
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   ci
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-24 Thread via GitHub



Madhukar525722 commented on code in PR #50022:
URL: https://github.com/apache/spark/pull/50022#discussion_r1967534275


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1407,13 +1408,74 @@ private[hive] object HiveClientImpl extends Logging {
   case _ =>
 new HiveConf(conf, classOf[HiveConf])
 }
-try {
+val hive = try {
   Hive.getWithoutRegisterFns(hiveConf)
 } catch {
   // SPARK-37069: not all Hive versions have the above method (e.g., Hive 
2.3.9 has it but
-  // 2.3.8 don't), therefore here we fallback when encountering the 
exception.
+  // 2.3.8 doesn't), therefore here we fallback when encountering the 
exception.
   case _: NoSuchMethodError =>
 Hive.get(hiveConf)
 }
+
+// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size 
when
+// `hive.thrift.client.max.message.size` is set and the value is positive
+Option(hiveConf.get("hive.thrift.client.max.message.size"))
+  .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0)
+  .foreach { maxMessageSize =>
+logDebug(s"Trying to set metastore client thrift max message to 
$maxMessageSize")

Review Comment:
   @pan3793 , when I enabled the debug log and try to run, I can see
   ```25/02/24 12:11:30 DEBUG HiveClientImpl: Trying to set metastore client 
thrift max message to 1073741824```
   But the logs related to Change the current metastore client thrift max 
message size from, is not there



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-24 Thread via GitHub



Madhukar525722 commented on code in PR #50022:
URL: https://github.com/apache/spark/pull/50022#discussion_r1967534275


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1407,13 +1408,74 @@ private[hive] object HiveClientImpl extends Logging {
   case _ =>
 new HiveConf(conf, classOf[HiveConf])
 }
-try {
+val hive = try {
   Hive.getWithoutRegisterFns(hiveConf)
 } catch {
   // SPARK-37069: not all Hive versions have the above method (e.g., Hive 
2.3.9 has it but
-  // 2.3.8 don't), therefore here we fallback when encountering the 
exception.
+  // 2.3.8 doesn't), therefore here we fallback when encountering the 
exception.
   case _: NoSuchMethodError =>
 Hive.get(hiveConf)
 }
+
+// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size 
when
+// `hive.thrift.client.max.message.size` is set and the value is positive
+Option(hiveConf.get("hive.thrift.client.max.message.size"))
+  .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0)
+  .foreach { maxMessageSize =>
+logDebug(s"Trying to set metastore client thrift max message to 
$maxMessageSize")

Review Comment:
   @pan3793 , I have build using latest patch only.
   And when I enabled the debug log and try to run, I can see
   ```25/02/24 12:11:30 DEBUG HiveClientImpl: Trying to set metastore client 
thrift max message to 1073741824```
   But the logs related to Change the current metastore client thrift max 
message size from, is not there



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]

2025-02-24 Thread via GitHub



pan3793 commented on code in PR #50022:
URL: https://github.com/apache/spark/pull/50022#discussion_r1967538536


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1407,13 +1408,74 @@ private[hive] object HiveClientImpl extends Logging {
   case _ =>
 new HiveConf(conf, classOf[HiveConf])
 }
-try {
+val hive = try {
   Hive.getWithoutRegisterFns(hiveConf)
 } catch {
   // SPARK-37069: not all Hive versions have the above method (e.g., Hive 
2.3.9 has it but
-  // 2.3.8 don't), therefore here we fallback when encountering the 
exception.
+  // 2.3.8 doesn't), therefore here we fallback when encountering the 
exception.
   case _: NoSuchMethodError =>
 Hive.get(hiveConf)
 }
+
+// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size 
when
+// `hive.thrift.client.max.message.size` is set and the value is positive
+Option(hiveConf.get("hive.thrift.client.max.message.size"))
+  .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0)
+  .foreach { maxMessageSize =>
+logDebug(s"Trying to set metastore client thrift max message to 
$maxMessageSize")

Review Comment:
   @Madhukar525722 you may need to add some additional logs to debug what 
happened.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]

2025-02-24 Thread via GitHub



pan3793 commented on code in PR #50057:
URL: https://github.com/apache/spark/pull/50057#discussion_r1967556026


##
core/benchmarks/ZStandardBenchmark-jdk21-results.txt:
##
@@ -2,48 +2,48 @@
 Benchmark ZStandardCompressionCodec
 

 
-OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Compression 1 times at level 1 without buffer pool656  
  668  12  0.0   65591.6   1.0X
-Compression 1 times at level 2 without buffer pool709  
  711   2  0.0   70934.6   0.9X
-Compression 1 times at level 3 without buffer pool814  
  818   5  0.0   81370.9   0.8X
-Compression 1 times at level 1 with buffer pool   601  
  603   2  0.0   60100.1   1.1X
-Compression 1 times at level 2 with buffer pool   634  
  636   2  0.0   63449.9   1.0X
-Compression 1 times at level 3 with buffer pool   748  
  753   5  0.0   74789.7   0.9X
+Compression 1 times at level 1 without buffer pool650  
  666  11  0.0   65044.6   1.0X
+Compression 1 times at level 2 without buffer pool701  
  703   2  0.0   70071.2   0.9X
+Compression 1 times at level 3 without buffer pool804  
  807   2  0.0   80430.6   0.8X
+Compression 1 times at level 1 with buffer pool   598  
  599   1  0.0   59810.5   1.1X
+Compression 1 times at level 2 with buffer pool   631  
  640  17  0.0   63096.7   1.0X
+Compression 1 times at level 3 with buffer pool   751  
  753   4  0.0   75051.1   0.9X
 
-OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Decompression 1 times from level 1 without buffer pool817  
  818   1  0.0   81723.2   1.0X
-Decompression 1 times from level 2 without buffer pool817  
  818   1  0.0   81729.4   1.0X
-Decompression 1 times from level 3 without buffer pool817  
  818   1  0.0   81719.7   1.0X
-Decompression 1 times from level 1 with buffer pool   749  
  757  14  0.0   74864.9   1.1X
-Decompression 1 times from level 2 with buffer pool   748  
  749   1  0.0   74789.1   1.1X
-Decompression 1 times from level 3 with buffer pool   748  
  749   1  0.0   74811.6   1.1X
+Decompression 1 times from level 1 without buffer pool823  
  831  14  0.0   82254.2   1.0X
+Decompression 1 times from level 2 without buffer pool824  
  827   5  0.0   82359.1   1.0X
+Decompression 1 times from level 3 without buffer pool819  
  821   2  0.0   81925.1   1.0X
+Decompression 1 times from level 1 with buffer pool   753  
  755   3  0.0   75284.1   1.1X
+Decompression 1 times from level 2 with buffer pool   753  
  755   1  0.0   75344.7   1.1X
+Decompression 1 times from level 3 with buffer pool   753  
  754   1  0.0   75263.0   1.1X
 
-OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Parallel Compression at level 3:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parallel Compression

Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]

2025-02-24 Thread via GitHub



beliefer commented on code in PR #50057:
URL: https://github.com/apache/spark/pull/50057#discussion_r1967569627


##
core/benchmarks/ZStandardBenchmark-jdk21-results.txt:
##
@@ -2,48 +2,48 @@
 Benchmark ZStandardCompressionCodec
 

 
-OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Compression 1 times at level 1 without buffer pool656  
  668  12  0.0   65591.6   1.0X
-Compression 1 times at level 2 without buffer pool709  
  711   2  0.0   70934.6   0.9X
-Compression 1 times at level 3 without buffer pool814  
  818   5  0.0   81370.9   0.8X
-Compression 1 times at level 1 with buffer pool   601  
  603   2  0.0   60100.1   1.1X
-Compression 1 times at level 2 with buffer pool   634  
  636   2  0.0   63449.9   1.0X
-Compression 1 times at level 3 with buffer pool   748  
  753   5  0.0   74789.7   0.9X
+Compression 1 times at level 1 without buffer pool650  
  666  11  0.0   65044.6   1.0X
+Compression 1 times at level 2 without buffer pool701  
  703   2  0.0   70071.2   0.9X
+Compression 1 times at level 3 without buffer pool804  
  807   2  0.0   80430.6   0.8X
+Compression 1 times at level 1 with buffer pool   598  
  599   1  0.0   59810.5   1.1X
+Compression 1 times at level 2 with buffer pool   631  
  640  17  0.0   63096.7   1.0X
+Compression 1 times at level 3 with buffer pool   751  
  753   4  0.0   75051.1   0.9X
 
-OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Decompression 1 times from level 1 without buffer pool817  
  818   1  0.0   81723.2   1.0X
-Decompression 1 times from level 2 without buffer pool817  
  818   1  0.0   81729.4   1.0X
-Decompression 1 times from level 3 without buffer pool817  
  818   1  0.0   81719.7   1.0X
-Decompression 1 times from level 1 with buffer pool   749  
  757  14  0.0   74864.9   1.1X
-Decompression 1 times from level 2 with buffer pool   748  
  749   1  0.0   74789.1   1.1X
-Decompression 1 times from level 3 with buffer pool   748  
  749   1  0.0   74811.6   1.1X
+Decompression 1 times from level 1 without buffer pool823  
  831  14  0.0   82254.2   1.0X
+Decompression 1 times from level 2 without buffer pool824  
  827   5  0.0   82359.1   1.0X
+Decompression 1 times from level 3 without buffer pool819  
  821   2  0.0   81925.1   1.0X
+Decompression 1 times from level 1 with buffer pool   753  
  755   3  0.0   75284.1   1.1X
+Decompression 1 times from level 2 with buffer pool   753  
  755   1  0.0   75344.7   1.1X
+Decompression 1 times from level 3 with buffer pool   753  
  754   1  0.0   75263.0   1.1X
 
-OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Parallel Compression at level 3:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parallel Compressio

Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]

2025-02-24 Thread via GitHub



pan3793 commented on code in PR #50057:
URL: https://github.com/apache/spark/pull/50057#discussion_r1967578779


##
core/benchmarks/ZStandardBenchmark-jdk21-results.txt:
##
@@ -2,48 +2,48 @@
 Benchmark ZStandardCompressionCodec
 

 
-OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Compression 1 times at level 1 without buffer pool656  
  668  12  0.0   65591.6   1.0X
-Compression 1 times at level 2 without buffer pool709  
  711   2  0.0   70934.6   0.9X
-Compression 1 times at level 3 without buffer pool814  
  818   5  0.0   81370.9   0.8X
-Compression 1 times at level 1 with buffer pool   601  
  603   2  0.0   60100.1   1.1X
-Compression 1 times at level 2 with buffer pool   634  
  636   2  0.0   63449.9   1.0X
-Compression 1 times at level 3 with buffer pool   748  
  753   5  0.0   74789.7   0.9X
+Compression 1 times at level 1 without buffer pool650  
  666  11  0.0   65044.6   1.0X
+Compression 1 times at level 2 without buffer pool701  
  703   2  0.0   70071.2   0.9X
+Compression 1 times at level 3 without buffer pool804  
  807   2  0.0   80430.6   0.8X
+Compression 1 times at level 1 with buffer pool   598  
  599   1  0.0   59810.5   1.1X
+Compression 1 times at level 2 with buffer pool   631  
  640  17  0.0   63096.7   1.0X
+Compression 1 times at level 3 with buffer pool   751  
  753   4  0.0   75051.1   0.9X
 
-OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Benchmark ZStandardCompressionCodec:Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Decompression 1 times from level 1 without buffer pool817  
  818   1  0.0   81723.2   1.0X
-Decompression 1 times from level 2 without buffer pool817  
  818   1  0.0   81729.4   1.0X
-Decompression 1 times from level 3 without buffer pool817  
  818   1  0.0   81719.7   1.0X
-Decompression 1 times from level 1 with buffer pool   749  
  757  14  0.0   74864.9   1.1X
-Decompression 1 times from level 2 with buffer pool   748  
  749   1  0.0   74789.1   1.1X
-Decompression 1 times from level 3 with buffer pool   748  
  749   1  0.0   74811.6   1.1X
+Decompression 1 times from level 1 without buffer pool823  
  831  14  0.0   82254.2   1.0X
+Decompression 1 times from level 2 without buffer pool824  
  827   5  0.0   82359.1   1.0X
+Decompression 1 times from level 3 without buffer pool819  
  821   2  0.0   81925.1   1.0X
+Decompression 1 times from level 1 with buffer pool   753  
  755   3  0.0   75284.1   1.1X
+Decompression 1 times from level 2 with buffer pool   753  
  755   1  0.0   75344.7   1.1X
+Decompression 1 times from level 3 with buffer pool   753  
  754   1  0.0   75263.0   1.1X
 
-OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure
+OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 Parallel Compression at level 3:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Parallel Compression

Re: [PR] [SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once [spark]

2025-02-24 Thread via GitHub



dusantism-db commented on code in PR #50027:
URL: https://github.com/apache/spark/pull/50027#discussion_r1967595134


##
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala:
##
@@ -599,6 +599,116 @@ class CaseStatementExec(
   }
 }
 
+/**
+ * Executable node for SimpleCaseStatement.
+ * @param caseVariableExec Statement with which all conditionExpressions will 
be compared to.
+ * @param conditionExpressions Collection of expressions which correspond to 
WHEN clauses.
+ * @param conditionalBodies Collection of executable bodies that have a 
corresponding condition,
+ * in WHEN branches.
+ * @param elseBody Body that is executed if none of the conditions are met, 
i.e. ELSE branch.
+ * @param session Spark session that SQL script is executed within.
+ * @param context SqlScriptingExecutionContext keeps the execution state of 
current script.
+ */
+class SimpleCaseStatementExec(
+caseVariableExec: SingleStatementExec,
+conditionExpressions: Seq[Expression],
+conditionalBodies: Seq[CompoundBodyExec],
+elseBody: Option[CompoundBodyExec],
+session: SparkSession,
+context: SqlScriptingExecutionContext) extends NonLeafStatementExec {
+  private object CaseState extends Enumeration {
+val Condition, Body = Value
+  }
+
+  private var state = CaseState.Condition
+  var bodyExec: Option[CompoundBodyExec] = None
+
+  var conditionBodyTupleIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] = _
+  private var caseVariableLiteral: Literal = _
+
+  private var isCacheValid = false
+  private def validateCache(): Unit = {
+if (!isCacheValid) {
+  val values = caseVariableExec.buildDataFrame(session).collect()
+  caseVariableExec.isExecuted = true
+
+  caseVariableLiteral = Literal(values.head.get(0))
+  conditionBodyTupleIterator = createConditionBodyIterator
+  isCacheValid = true
+}
+  }
+
+  private def cachedCaseVariableLiteral: Literal = {
+validateCache()
+caseVariableLiteral
+  }
+
+  private def cachedConditionBodyIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] = {
+validateCache()
+conditionBodyTupleIterator
+  }
+
+  private lazy val treeIterator: Iterator[CompoundStatementExec] =
+new Iterator[CompoundStatementExec] {
+  override def hasNext: Boolean = state match {
+case CaseState.Condition => cachedConditionBodyIterator.hasNext || 
elseBody.isDefined
+case CaseState.Body => bodyExec.exists(_.getTreeIterator.hasNext)
+  }
+
+  override def next(): CompoundStatementExec = state match {
+case CaseState.Condition =>
+  cachedConditionBodyIterator.nextOption()
+.map { case (condStmt, body) =>
+  if (evaluateBooleanCondition(session, condStmt)) {
+bodyExec = Some(body)
+state = CaseState.Body
+  }
+  condStmt
+}
+.orElse(elseBody.map { body => {
+  bodyExec = Some(body)
+  state = CaseState.Body
+  next()
+}})
+.get
+case CaseState.Body => bodyExec.get.getTreeIterator.next()
+  }
+}
+
+  private def createConditionBodyIterator: Iterator[(SingleStatementExec, 
CompoundBodyExec)] =
+conditionExpressions.zip(conditionalBodies)
+  .iterator
+  .map { case (expr, body) =>
+val condition = Project(
+  Seq(Alias(EqualTo(cachedCaseVariableLiteral, expr), "condition")()),
+  OneRowRelation()
+)
+// We hack the Origin to provide more descriptive error messages. For 
example, if
+// the case variable is 1 and the condition expression it's compared 
to is 5, we
+// will get Origin with text "(1 = 5)".
+val conditionText = 
condition.projectList.head.asInstanceOf[Alias].child.toString
+val condStmt = new SingleStatementExec(
+  condition,
+  Origin(sqlText = Some(conditionText),
+startIndex = Some(0),
+stopIndex = Some(conditionText.length - 1)),

Review Comment:
   I added the line propagation - 
https://github.com/apache/spark/pull/50027/commits/d06ab7310a1dd2fd1084943194d78fe3900845d6



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51290][SQL] Enable filling default values in DSv2 writes [spark]

2025-02-24 Thread via GitHub



cloud-fan commented on code in PR #50044:
URL: https://github.com/apache/spark/pull/50044#discussion_r1967618587


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -3534,7 +3534,8 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 TableOutputResolver.suitableForByNameCheck(v2Write.isByName,
   expected = v2Write.table.output, queryOutput = v2Write.query.output)
 val projection = TableOutputResolver.resolveOutputColumns(
-  v2Write.table.name, v2Write.table.output, v2Write.query, 
v2Write.isByName, conf)
+  v2Write.table.name, v2Write.table.output, v2Write.query, 
v2Write.isByName, conf,
+  supportColDefaultValue = true)

Review Comment:
   Yea true, Spark fills the default values during table writing and it works 
for all catalogs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 149 matches

Mail list logo