Re: [PR] [SPARK-51305][SQL][CONNECT] Improve `SparkConnectPlanExecution.createObservedMetricsResponse` [spark]
beliefer commented on PR #50066: URL: https://github.com/apache/spark/pull/50066#issuecomment-2680591378 @dongjoon-hyun Thank you ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]
cloud-fan commented on code in PR #50040: URL: https://github.com/apache/spark/pull/50040#discussion_r1968959548 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala: ## @@ -841,20 +841,24 @@ class DataFrameWriterV2Suite extends QueryTest with SharedSparkSession with Befo } test("SPARK-51281: create/replace file source tables") { +def checkResults(df: DataFrame): Unit = { + checkAnswer(df, spark.range(10).toDF()) +} Review Comment: what do you mean? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51265][SQL][SS] Throw proper error for eagerlyExecuteCommands containing streaming source marker [spark]
HeartSaVioR closed pull request #50015: [SPARK-51265][SQL][SS] Throw proper error for eagerlyExecuteCommands containing streaming source marker URL: https://github.com/apache/spark/pull/50015 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51265][SQL][SS] Throw proper error for eagerlyExecuteCommands containing streaming source marker [spark]
HeartSaVioR commented on PR #50015: URL: https://github.com/apache/spark/pull/50015#issuecomment-2680891589 Closing via #50037 - much simpler change and both of PRs do not address the origin report which @cloud-fan will address later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51187][SQL][SS] Implement the graceful deprecation of incorrect config introduced in SPARK-49699 [spark]
HeartSaVioR commented on PR #49983: URL: https://github.com/apache/spark/pull/49983#issuecomment-2680925857 @cloud-fan > have we merged this graceful deprecation in branch 3.5? Yes, that is merged. It's still a blocker for Spark 4.0.0 though. @dongjoon-hyun > If you don't mind, I'd like to propose to wait until we finish Apache Spark 3.5.5 release. Would you mind explaining why this has to be coupled with Spark 3.5.5, as discussion for 4.1/4.0 can always happen in parallel? I sincerely disagree that the option 2 is viable for us. This builds a huge limitation of SS users for Spark 3.5.4 on upgrading (they have to touch Spark 3.5.x before upgrading to 4.0.0), to achieve our goal of removing the "string" of incorrect config immediately. Again, this is just a string because it's only for migration logic and users won't be able to set the config manually. Why not just be more open for Spark 3.5.4 users to have a range of upgrade path? If you'd like to hear more voices, I'm happy to post the discussion to dev@. I just don't want this change to be coupled with non-relevant release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
pan3793 commented on code in PR #50022: URL: https://github.com/apache/spark/pull/50022#discussion_r1969188058 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1407,13 +1410,83 @@ private[hive] object HiveClientImpl extends Logging { case _ => new HiveConf(conf, classOf[HiveConf]) } -try { +val hive = try { Hive.getWithoutRegisterFns(hiveConf) } catch { // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but - // 2.3.8 don't), therefore here we fallback when encountering the exception. + // 2.3.8 doesn't), therefore here we fallback when encountering the exception. case _: NoSuchMethodError => Hive.get(hiveConf) } + +// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size when +// `hive.thrift.client.max.message.size` is set and the value is positive +Option(hiveConf.get("hive.thrift.client.max.message.size")) + .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0) + .foreach { maxMessageSize => +logDebug(s"Trying to set metastore client thrift max message to $maxMessageSize") +configureMaxThriftMessageSize(hiveConf, hive.getMSC, maxMessageSize) + } + +hive + } + + private def getFieldValue[T](obj: Any, fieldName: String): T = { +val field = obj.getClass.getDeclaredField(fieldName) +field.setAccessible(true) +field.get(obj).asInstanceOf[T] + } + + private def getFieldValue[T](obj: Any, clazz: Class[_], fieldName: String): T = { +val field = clazz.getDeclaredField(fieldName) +field.setAccessible(true) +field.get(obj).asInstanceOf[T] + } + + // SPARK-49489: a surgery for Hive 2.3.10 due to lack of HIVE-26633 + private def configureMaxThriftMessageSize( + hiveConf: HiveConf, msClient: IMetaStoreClient, maxMessageSize: Int): Unit = try { +msClient match { + // Hive uses Java Dynamic Proxy to enhance the MetaStoreClient to support synchronization + // and retrying, we should unwrap and access the underlying MetaStoreClient instance firstly + case proxy if JdkProxy.isProxyClass(proxy.getClass) => +JdkProxy.getInvocationHandler(proxy) match { + case syncHandler if syncHandler.getClass.getName.endsWith("SynchronizedHandler") => +val wrappedMsc = getFieldValue[IMetaStoreClient](syncHandler, "client") +configureMaxThriftMessageSize(hiveConf, wrappedMsc, maxMessageSize) + case retryHandler: RetryingMetaStoreClient => +val wrappedMsc = getFieldValue[IMetaStoreClient](retryHandler, "base") +configureMaxThriftMessageSize(hiveConf, wrappedMsc, maxMessageSize) + case _ => +} + case msc: HiveMetaStoreClient if !msc.isLocalMetaStore => +@tailrec +def configure(t: TTransport): Unit = t match { + // Unwrap and access the underlying TTransport when security enabled (Kerberos) + case tTransport: TFilterTransport => Review Comment: @@Madhukar525722 kerberos cases are addressed here ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1407,13 +1410,83 @@ private[hive] object HiveClientImpl extends Logging { case _ => new HiveConf(conf, classOf[HiveConf]) } -try { +val hive = try { Hive.getWithoutRegisterFns(hiveConf) } catch { // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but - // 2.3.8 don't), therefore here we fallback when encountering the exception. + // 2.3.8 doesn't), therefore here we fallback when encountering the exception. case _: NoSuchMethodError => Hive.get(hiveConf) } + +// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size when +// `hive.thrift.client.max.message.size` is set and the value is positive +Option(hiveConf.get("hive.thrift.client.max.message.size")) + .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0) + .foreach { maxMessageSize => +logDebug(s"Trying to set metastore client thrift max message to $maxMessageSize") +configureMaxThriftMessageSize(hiveConf, hive.getMSC, maxMessageSize) + } + +hive + } + + private def getFieldValue[T](obj: Any, fieldName: String): T = { +val field = obj.getClass.getDeclaredField(fieldName) +field.setAccessible(true) +field.get(obj).asInstanceOf[T] + } + + private def getFieldValue[T](obj: Any, clazz: Class[_], fieldName: String): T = { +val field = clazz.getDeclaredField(fieldName) +field.setAccessible(true) +field.get(obj).asInstanceOf[T] + } + + // SPARK-49489: a surgery for Hive 2.3.10 due to lack of HIVE-26633 + private def configureMaxThriftMessageSize( + hiveConf: HiveConf, msClient: IMetaStoreClient, maxMessageSize: Int): Unit = try { +msClient match { + /
Re: [PR] [SPARK-50856][SS][PYTHON][CONNECT] Spark Connect Support for TransformWithStateInPandas In Python [spark]
hvanhovell commented on code in PR #49560: URL: https://github.com/apache/spark/pull/49560#discussion_r1968788349 ## sql/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -1034,6 +1038,49 @@ class SparkConnectPlanner( .logicalPlan } + private def transformTransformWithStateInPandas( + pythonUdf: PythonUDF, + groupedDs: RelationalGroupedDataset, + rel: proto.GroupMap): LogicalPlan = { +val twsInfo = rel.getTransformWithStateInfo +val outputSchema = parseSchema(twsInfo.getOutputSchema) + +if (rel.hasInitialInput) { + val initialGroupingCols = rel.getInitialGroupingExpressionsList.asScala.toSeq.map(expr => +Column(transformExpression(expr))) + + val initialStateDs = Dataset +.ofRows(session, transformRelation(rel.getInitialInput)) Review Comment: I am aware it is used in multiple places in the planner. That does not mean it should be encouraged. We want to get rid of most of this after Spark 4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51289][SQL] Throw a proper error message for not fully implemented `SQLTableFunction` [spark]
wayneguow commented on PR #50073: URL: https://github.com/apache/spark/pull/50073#issuecomment-2680420915 cc @MaxGekk @allisonwang-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50692][SQL][FOLLOWUP] Add the LPAD and RPAD pushdown support for H2 [spark]
beliefer commented on PR #50068: URL: https://github.com/apache/spark/pull/50068#issuecomment-2680314901 > Oh, did you aim to use this as a follow-up, @beliefer ? Uh, I forgot it. I want it to be a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51308][CONNECT][BUILD] Update the relocation rules for the `connect` module in `SparkBuild.scala` to ensure that both Maven and SBT produce the assembly JAR according to the same rules [sp
wayneguow opened a new pull request, #50075: URL: https://github.com/apache/spark/pull/50075 ### What changes were proposed in this pull request? This PR aims to update the relocation rules for the `connect` module in `SparkBuild.scala`. ### Why are the changes needed? Ensure that both Maven and SBT produce the assembly JAR according to the same rules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]
dongjoon-hyun commented on code in PR #50040: URL: https://github.com/apache/spark/pull/50040#discussion_r1969044495 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -5545,6 +5545,15 @@ object SQLConf { .booleanConf .createWithDefault(false) + val LEGACY_DF_WRITER_V2_IGNORE_PATH_OPTION = +buildConf("spark.sql.legacy.dataFrameWriterV2IgnorePathOption") + .internal() + .doc("When set to true, DataFrameWriterV2 ignores the 'path' option and always write data " + +"to the default table location.") + .version("3.5.5") Review Comment: > I'll change it to 3.5.6 after the 3.5.5 RC1 vote passes. Ya, let's wait and see the vote result because it's still open. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51281][SQL] DataFrameWriterV2 should respect the path option [spark]
szehon-ho commented on code in PR #50040: URL: https://github.com/apache/spark/pull/50040#discussion_r1969046534 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala: ## @@ -839,4 +839,30 @@ class DataFrameWriterV2Suite extends QueryTest with SharedSparkSession with Befo condition = "CALL_ON_STREAMING_DATASET_UNSUPPORTED", parameters = Map("methodName" -> "`writeTo`")) } + + test("SPARK-51281: create/replace file source tables") { Review Comment: question: are we testing replace here? Why not just the name of the jira? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51309][BUILD] Upgrade rocksdbjni to 9.10.0 [spark]
wayneguow commented on PR #50076: URL: https://github.com/apache/spark/pull/50076#issuecomment-2680812582 Related benchmark results: - jdk17: https://github.com/wayneguow/spark/actions/runs/13513028574 - jdk21: https://github.com/wayneguow/spark/actions/runs/13513032754 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51309][BUILD] Upgrade rocksdbjni to 9.10.0 [spark]
wayneguow opened a new pull request, #50076: URL: https://github.com/apache/spark/pull/50076 ### What changes were proposed in this pull request? The pr aims to upgrade `rocksdbjni` from 9.8.4 to 9.10.0. ### Why are the changes needed? There are some bug fixes and performance Improvements, full release notes: - https://github.com/facebook/rocksdb/releases/tag/v9.9.3 - https://github.com/facebook/rocksdb/releases/tag/v9.10.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]
micheal-o commented on code in PR #50030: URL: https://github.com/apache/spark/pull/50030#discussion_r1968977014 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala: ## @@ -419,6 +432,10 @@ private[sql] class HDFSBackedStateStoreProvider extends StateStoreProvider with private val loadedMapCacheHitCount: LongAdder = new LongAdder private val loadedMapCacheMissCount: LongAdder = new LongAdder + // This is updated when the maintenance task writes the snapshot file and read by the task Review Comment: it is possible for task thread to do snapshot upload too right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51289][SQL] Throw a proper error message for not fully implemented `SQLTableFunction` [spark]
LuciferYang commented on PR #50073: URL: https://github.com/apache/spark/pull/50073#issuecomment-2680871903 also cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]
liviazhu-db commented on code in PR #50030: URL: https://github.com/apache/spark/pull/50030#discussion_r1968325595 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala: ## @@ -419,6 +433,10 @@ private[sql] class HDFSBackedStateStoreProvider extends StateStoreProvider with private val loadedMapCacheHitCount: LongAdder = new LongAdder private val loadedMapCacheMissCount: LongAdder = new LongAdder + // This is updated when the maintenance task writes the snapshot file, -1 represents no version + // has ever been uploaded. + private var lastSnapshotUploadedVersion: Long = -1L Review Comment: Does this not need to be an AtomicLong, since maintenance thread will modify it and task thread will read it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]
liviazhu-db commented on code in PR #50030: URL: https://github.com/apache/spark/pull/50030#discussion_r1968325595 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala: ## @@ -419,6 +433,10 @@ private[sql] class HDFSBackedStateStoreProvider extends StateStoreProvider with private val loadedMapCacheHitCount: LongAdder = new LongAdder private val loadedMapCacheMissCount: LongAdder = new LongAdder + // This is updated when the maintenance task writes the snapshot file, -1 represents no version + // has ever been uploaded. + private var lastSnapshotUploadedVersion: Long = -1L Review Comment: Does this not need to be an AtomicLong, since maintenance thread will modify it and task thread will read it? Can you add this in a comment as well? ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala: ## @@ -419,6 +433,10 @@ private[sql] class HDFSBackedStateStoreProvider extends StateStoreProvider with private val loadedMapCacheHitCount: LongAdder = new LongAdder private val loadedMapCacheMissCount: LongAdder = new LongAdder + // This is updated when the maintenance task writes the snapshot file, -1 represents no version + // has ever been uploaded. + private var lastSnapshotUploadedVersion: Long = -1L Review Comment: Doesn't this need to be an AtomicLong, since maintenance thread will modify it and task thread will read it? Can you add this in a comment as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51290][SQL] Enable filling default values in DSv2 writes [spark]
viirya commented on code in PR #50044: URL: https://github.com/apache/spark/pull/50044#discussion_r1968152464 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala: ## @@ -718,6 +724,11 @@ private class BufferedRowsReader( schema: StructType, row: InternalRow): Any = { val index = schema.fieldIndex(field.name) + +if (index >= row.numFields) { Review Comment: This is method `extractFieldValue`. Looks like it is only used by `get`. Why this is for adding columns? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50856][SS][PYTHON][CONNECT] Spark Connect Support for TransformWithStateInPandas In Python [spark]
jingz-db commented on code in PR #49560: URL: https://github.com/apache/spark/pull/49560#discussion_r1968245505 ## sql/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -1031,6 +1031,26 @@ message GroupMap { // (Optional) The schema for the grouped state. optional DataType state_schema = 10; + + // Below fields are used by TransformWithState and TransformWithStateInPandas + // (Optional) TransformWithState related parameters. + optional TransformWithStateInfo transform_with_state_info = 11; +} + +// Additional input parameters used for TransformWithState operator. +message TransformWithStateInfo { Review Comment: > This would look better, however, it is not backward compatible. Yeah agreed... I guess it is not very safe to change the connect protocol for a released operator. We should probably leave FMGWS as it is in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
Madhukar525722 commented on PR #50022: URL: https://github.com/apache/spark/pull/50022#issuecomment-2679438548 HI @pan3793 . The flow was able to reach the case msc: But there I added the debug log - ``` msc.getTTransport match { case t: TEndpointTransport => val currentMaxMessageSize = t.getConfiguration.getMaxMessageSize ... case _ => logDebug(s"The metastore client transport is not TEndpointTransport, but: ${msc.getTTransport.getClass.getName}") } ``` and found corresponding value to be The metastore client transport is not TEndpointTransport, but: org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport Does it require, more unwrapping before, getting the exact TEndpointTransport -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]
liviazhu-db commented on code in PR #50030: URL: https://github.com/apache/spark/pull/50030#discussion_r1968396281 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala: ## @@ -219,7 +219,18 @@ private[sql] class HDFSBackedStateStoreProvider extends StateStoreProvider with supportedCustomMetrics.find(_.name == name).map(_ -> value) } + (metricStateOnCurrentVersionSizeBytes -> SizeEstimator.estimate(mapToUpdate)) - StateStoreMetrics(mapToUpdate.size(), metricsFromProvider("memoryUsedBytes"), customMetrics) + val instanceMetrics = Map( +instanceMetricSnapshotLastUpload.withNewId( Review Comment: Oh I see, makes sense -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]
liviazhu-db commented on code in PR #50030: URL: https://github.com/apache/spark/pull/50030#discussion_r1968332195 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala: ## @@ -219,7 +219,18 @@ private[sql] class HDFSBackedStateStoreProvider extends StateStoreProvider with supportedCustomMetrics.find(_.name == name).map(_ -> value) } + (metricStateOnCurrentVersionSizeBytes -> SizeEstimator.estimate(mapToUpdate)) - StateStoreMetrics(mapToUpdate.size(), metricsFromProvider("memoryUsedBytes"), customMetrics) + val instanceMetrics = Map( +instanceMetricSnapshotLastUpload.withNewId( Review Comment: Why do we do withNewId here and not assign these values when we create `instanceMetricSnapshotLastUpload` on line 452? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]
zecookiez commented on code in PR #50030: URL: https://github.com/apache/spark/pull/50030#discussion_r1968378429 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala: ## @@ -219,7 +219,18 @@ private[sql] class HDFSBackedStateStoreProvider extends StateStoreProvider with supportedCustomMetrics.find(_.name == name).map(_ -> value) } + (metricStateOnCurrentVersionSizeBytes -> SizeEstimator.estimate(mapToUpdate)) - StateStoreMetrics(mapToUpdate.size(), metricsFromProvider("memoryUsedBytes"), customMetrics) + val instanceMetrics = Map( +instanceMetricSnapshotLastUpload.withNewId( Review Comment: Good question, it looks like the main issue is this:  Seems like state store ID is initialized at a later point in time than the metrics, and the default value is null -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51252] [SS] Add instance metrics for last uploaded snapshot version in HDFS State Stores [spark]
zecookiez commented on code in PR #50030: URL: https://github.com/apache/spark/pull/50030#discussion_r1968381328 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala: ## @@ -419,6 +433,10 @@ private[sql] class HDFSBackedStateStoreProvider extends StateStoreProvider with private val loadedMapCacheHitCount: LongAdder = new LongAdder private val loadedMapCacheMissCount: LongAdder = new LongAdder + // This is updated when the maintenance task writes the snapshot file, -1 represents no version + // has ever been uploaded. + private var lastSnapshotUploadedVersion: Long = -1L Review Comment: Good catch, I initially thought this wasn't needed for HDFS but I misread how we're using `writeSnapshot`. Updated with comment! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50914][PYTHON][CONNECT] Match GRPC dependencies for Python-only master scheduled job [spark]
HyukjinKwon commented on PR #50058: URL: https://github.com/apache/spark/pull/50058#issuecomment-2679997693 cc @dongjoon-hyun This should fix the scheduled build and make it green 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50914][PYTHON][CONNECT] Match GRPC dependencies for Python-only master scheduled job [spark]
dongjoon-hyun commented on PR #50058: URL: https://github.com/apache/spark/pull/50058#issuecomment-268537 Thank you, @HyukjinKwon ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51307][SQL] locationUri in CatalogStorageFormat shall be decoded for display [spark]
yaooqinn opened a new pull request, #50074: URL: https://github.com/apache/spark/pull/50074 ### What changes were proposed in this pull request? This PR uses CatalogUtils.URIToString instead of URI.toString to decode the location URI. ### Why are the changes needed? For example, for partition specs like test1=X'16', test3=timestamp'2018-11-17 13:33:33', the stored path will include them as `test1=%16/test3=2018-11-17 13%3A33%3A33` because the special characters are escaped. Furthermore, while resolving the whole path string to a URI object, this path fragment becomes `test1=%2516/test3=2018-11-17 13%253A33%253A33`, so we need to decode `%25` -> `%` before displaying to users ### Does this PR introduce _any_ user-facing change? yes, DESC TABLE will not show 2x-encoded paths. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51304][DOCS][PYTHON] Use `getCondition` instead of `getErrorClass` in contribution guide [spark]
itholic commented on PR #50062: URL: https://github.com/apache/spark/pull/50062#issuecomment-2680085229 Thanks all for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51289][SQL] Throw a proper error message for not fully implemented `SQLTableFunction` [spark]
wayneguow opened a new pull request, #50073: URL: https://github.com/apache/spark/pull/50073 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]
hvanhovell commented on code in PR #50013: URL: https://github.com/apache/spark/pull/50013#discussion_r1968733126 ## mllib/src/main/scala/org/apache/spark/ml/classification/FMClassifier.scala: ## @@ -235,6 +236,13 @@ class FMClassifier @Since("3.0.0") ( model.setSummary(Some(summary)) } + override def estimateModelSize(dataset: Dataset[_]): Long = { +val numFeatures = DatasetUtils.getNumFeatures(dataset, $(featuresCol)) Review Comment: I am worried that executing the input query multiple times will be considered wasteful. I am wondering if we should do the check while fitting instead; we fail as soon as the model gets too large. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]
hvanhovell commented on code in PR #50013: URL: https://github.com/apache/spark/pull/50013#discussion_r1968734181 ## mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala: ## @@ -1248,6 +1263,11 @@ class LogisticRegressionModel private[spark] ( } } + private[spark] override def estimatedSize: Long = { +SizeEstimator.estimate((this.params, this.uid, Review Comment: NIT: Can we just make the SizeEstimator take varargs instead of creating a tuple here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]
hvanhovell commented on code in PR #50013: URL: https://github.com/apache/spark/pull/50013#discussion_r1968738761 ## common/utils/src/main/resources/error/error-conditions.json: ## @@ -780,6 +780,11 @@ "Cannot retrieve from the ML cache. It is probably because the entry has been evicted." ] }, + "CACHE_ITEM_EXCEEDED" : { +"message" : [ + "Cannot object , since its estimated size exceeds the limitation ." Review Comment: `...exceeds the maximum allowed size .` instead of `exceeds the limitation .`? Does this use proper size formats? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [DRAFT] Two string types [spark]
github-actions[bot] closed pull request #48861: [DRAFT] Two string types URL: https://github.com/apache/spark/pull/48861 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]
hvanhovell commented on code in PR #50013: URL: https://github.com/apache/spark/pull/50013#discussion_r1968708896 ## sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLCache.scala: ## @@ -21,23 +21,52 @@ import java.util.concurrent.{ConcurrentMap, TimeUnit} import com.google.common.cache.CacheBuilder -import org.apache.spark.internal.Logging +import org.apache.spark.internal.{Logging, LogKeys, MDC} +import org.apache.spark.ml.Model import org.apache.spark.ml.util.ConnectHelper +import org.apache.spark.sql.classic.SparkSession +import org.apache.spark.sql.connect.config.Connect._ +import org.apache.spark.util.SizeEstimator /** * MLCache is for caching ML objects, typically for models and summaries evaluated by a model. */ -private[connect] class MLCache extends Logging { +private[connect] class MLCache(session: SparkSession) extends Logging { private val helper = new ConnectHelper() private val helperID = "__ML_CONNECT_HELPER__" + private def conf = session.sessionState.conf - private val cachedModel: ConcurrentMap[String, Object] = CacheBuilder -.newBuilder() -.softValues() -.maximumSize(MLCache.MAX_CACHED_ITEMS) -.expireAfterAccess(MLCache.CACHE_TIMEOUT_MINUTE, TimeUnit.MINUTES) -.build[String, Object]() -.asMap() + private val cachedModel: ConcurrentMap[String, (Object, Long)] = { +val builder = CacheBuilder.newBuilder().softValues() + +val cacheWeight = conf.getConf(CONNECT_SESSION_ML_CACHE_TOTAL_ITEM_SIZE) +val cacheSize = conf.getConf(CONNECT_SESSION_ML_CACHE_SIZE) +val timeOut = conf.getConf(CONNECT_SESSION_ML_CACHE_TIMEOUT) + +if (cacheWeight > 0) { + builder Review Comment: What is the eviction policy here? LRU? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50795][SQL][FOLLOWUP] Set isParsing to false for the timestamp formatter in DESCRIBE AS JSON [spark]
yaooqinn commented on PR #50065: URL: https://github.com/apache/spark/pull/50065#issuecomment-2680188025 Merged to master/4.0, thank you @dongjoon-hyun @asl3 @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]
hvanhovell commented on code in PR #50013: URL: https://github.com/apache/spark/pull/50013#discussion_r1968704565 ## sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLHandler.scala: ## @@ -125,6 +127,15 @@ private[connect] object MLHandler extends Logging { val dataset = MLUtils.parseRelationProto(fitCmd.getDataset, sessionHolder) val estimator = MLUtils.getEstimator(sessionHolder, estimatorProto, Some(fitCmd.getParams)) + +// pre-training model size check +val maxSize = conf.getConf(Connect.CONNECT_SESSION_ML_CACHE_SINGLE_ITEM_SIZE) +if (maxSize > 0) { + val estimatedSize = estimator.estimateModelSize(dataset) Review Comment: How accurate is this? How much state is added to the model while fitting? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]
hvanhovell commented on code in PR #50013: URL: https://github.com/apache/spark/pull/50013#discussion_r1968705260 ## sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLException.scala: ## @@ -36,3 +36,17 @@ private[spark] case class MLCacheInvalidException(objectName: String) errorClass = "CONNECT_ML.CACHE_INVALID", messageParameters = Map("objectName" -> objectName), cause = null) + +private[spark] case class MlItemSizeExceededException( Review Comment: Let's make sure the client side error mappings can deal with this new Exception. cc @itholic . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50795][SQL][FOLLOWUP] Set isParsing to false for the timestamp formatter in DESCRIBE AS JSON [spark]
dongjoon-hyun commented on PR #50065: URL: https://github.com/apache/spark/pull/50065#issuecomment-2680207829 Thank you, @yaooqinn and all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51306][TESTS] Fix test errors caused by improper DROP TABLE/VIEW in describe.sql [spark]
dongjoon-hyun commented on PR #50061: URL: https://github.com/apache/spark/pull/50061#issuecomment-2680208684 Thank you for adding JIRA issue ID, @yaooqinn . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]
hvanhovell commented on code in PR #50013: URL: https://github.com/apache/spark/pull/50013#discussion_r1968711609 ## mllib/src/main/scala/org/apache/spark/ml/util/Summary.scala: ## @@ -18,11 +18,21 @@ package org.apache.spark.ml.util import org.apache.spark.annotation.Since +import org.apache.spark.util.KnownSizeEstimation /** + * For ml connect only. * Trait for the Summary * All the summaries should extend from this Summary in order to * support connect. */ @Since("4.0.0") -private[spark] trait Summary +private[spark] trait Summary extends KnownSizeEstimation { + + // A summary is normally a small object, with several RDDs or DataFrame. + // The SizeEstimator is likely to overestimate the size of the summary, + // because it will also count the underlying SparkSession and/or SparkContext, Review Comment: This is why I don't like the SizeEstimator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]
hvanhovell commented on code in PR #50013: URL: https://github.com/apache/spark/pull/50013#discussion_r1968717393 ## mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala: ## @@ -504,6 +506,10 @@ object Vectors { /** Max number of nonzero entries used in computing hash code. */ private[linalg] val MAX_HASH_NNZ = 128 + + private[ml] def getSparseSize(nnz: Long): Long = nnz * 12 + 20 Review Comment: Let's try to avoid hard coding constants. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51099][PYTHON][FOLLOWUP] Avoid logging when selector.select returns 0 without waiting the configured timeout [spark]
HyukjinKwon closed pull request #50071: [SPARK-51099][PYTHON][FOLLOWUP] Avoid logging when selector.select returns 0 without waiting the configured timeout URL: https://github.com/apache/spark/pull/50071 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51099][PYTHON][FOLLOWUP][4.0] Avoid logging when selector.select returns 0 without waiting the configured timeout [spark]
HyukjinKwon closed pull request #50072: [SPARK-51099][PYTHON][FOLLOWUP][4.0] Avoid logging when selector.select returns 0 without waiting the configured timeout URL: https://github.com/apache/spark/pull/50072 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50914][PYTHON][CONNECT] Match GRPC dependencies for Python-only master scheduled job [spark]
HyukjinKwon closed pull request #50058: [SPARK-50914][PYTHON][CONNECT] Match GRPC dependencies for Python-only master scheduled job URL: https://github.com/apache/spark/pull/50058 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50914][PYTHON][CONNECT] Match GRPC dependencies for Python-only master scheduled job [spark]
HyukjinKwon commented on PR #50058: URL: https://github.com/apache/spark/pull/50058#issuecomment-2680012404 Merged to master and branch-4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50319] Reorder ResolveIdentifierClause and BindParameter rules [spark]
github-actions[bot] closed pull request #48849: [SPARK-50319] Reorder ResolveIdentifierClause and BindParameter rules URL: https://github.com/apache/spark/pull/48849 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51306][TESTS] Fix test errors caused by improper DROP TABLE/VIEW in describe.sql [spark]
yaooqinn closed pull request #50061: [SPARK-51306][TESTS] Fix test errors caused by improper DROP TABLE/VIEW in describe.sql URL: https://github.com/apache/spark/pull/50061 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]
hvanhovell commented on code in PR #50013: URL: https://github.com/apache/spark/pull/50013#discussion_r1968701252 ## sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala: ## @@ -313,4 +313,49 @@ object Connect { .internal() .booleanConf .createWithDefault(true) + + val CONNECT_SESSION_ML_CACHE_TIMEOUT = +buildConf("spark.connect.session.mlCache.timeout") Review Comment: I am not sure if we can do this. This is a non-recoverable situation for a client. The client itself should tell us they are done with a model. I would suggest instead that we either fail and give them a list of models they can evict, or that we store old models on disk. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50795][SQL][FOLLOWUP] Set isParsing to false for the timestamp formatter in DESCRIBE AS JSON [spark]
yaooqinn closed pull request #50065: [SPARK-50795][SQL][FOLLOWUP] Set isParsing to false for the timestamp formatter in DESCRIBE AS JSON URL: https://github.com/apache/spark/pull/50065 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]
hvanhovell commented on code in PR #50013: URL: https://github.com/apache/spark/pull/50013#discussion_r1968703858 ## sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLHandler.scala: ## @@ -125,6 +127,15 @@ private[connect] object MLHandler extends Logging { val dataset = MLUtils.parseRelationProto(fitCmd.getDataset, sessionHolder) val estimator = MLUtils.getEstimator(sessionHolder, estimatorProto, Some(fitCmd.getParams)) + +// pre-training model size check +val maxSize = conf.getConf(Connect.CONNECT_SESSION_ML_CACHE_SINGLE_ITEM_SIZE) +if (maxSize > 0) { + val estimatedSize = estimator.estimateModelSize(dataset) + if (estimatedSize > maxSize) { +throw MlItemSizeExceededException("fit", estimator.uid, estimatedSize, maxSize) Review Comment: No we should not stop the session. The client can technically recover by increasing the limits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51306][TESTS] Fix test errors caused by improper DROP TABLE/VIEW in describe.sql [spark]
yaooqinn commented on PR #50061: URL: https://github.com/apache/spark/pull/50061#issuecomment-2680181236 Thank you @dongjoon-hyun @LuciferYang, SPARK-51306 is attached. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50692][SQL][FOLLOWUP] Add the LPAD and RPAD pushdown support for H2 [spark]
beliefer commented on PR #50068: URL: https://github.com/apache/spark/pull/50068#issuecomment-2680315941 @dongjoon-hyun Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51302][CONNECT] Spark Connect supports JDBC should use the DataFrameReader API [spark]
beliefer commented on PR #50059: URL: https://github.com/apache/spark/pull/50059#issuecomment-2680328835 > Do you think you can add some test cases, @beliefer , to be clear what was the problem and to prevent a future regression? Spark Connect already have the test cases. This improvement is just to unify the code path and improve the maintenance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50856][SS][PYTHON][CONNECT] Spark Connect Support for TransformWithStateInPandas In Python [spark]
hvanhovell commented on code in PR #49560: URL: https://github.com/apache/spark/pull/49560#discussion_r1968783156 ## sql/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -1031,6 +1031,26 @@ message GroupMap { // (Optional) The schema for the grouped state. optional DataType state_schema = 10; + + // Below fields are used by TransformWithState and TransformWithStateInPandas + // (Optional) TransformWithState related parameters. + optional TransformWithStateInfo transform_with_state_info = 11; +} + +// Additional input parameters used for TransformWithState operator. +message TransformWithStateInfo { + // (Required) Time mode string for transformWithState. + string time_mode = 1; Review Comment: I can't find that comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50856][SS][PYTHON][CONNECT] Spark Connect Support for TransformWithStateInPandas In Python [spark]
hvanhovell commented on code in PR #49560: URL: https://github.com/apache/spark/pull/49560#discussion_r1968784789 ## sql/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -1031,6 +1031,26 @@ message GroupMap { // (Optional) The schema for the grouped state. optional DataType state_schema = 10; + + // Below fields are used by TransformWithState and TransformWithStateInPandas + // (Optional) TransformWithState related parameters. + optional TransformWithStateInfo transform_with_state_info = 11; +} + +// Additional input parameters used for TransformWithState operator. +message TransformWithStateInfo { Review Comment: @haiyangsun-db you could make it compatible by retaining the original fields. The only problem is that now you have two codepaths in the planner to maintain. @jingz-db we only have to be backwards compatible (older client should work on a newer service). As long as we make additive changes we should be good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51300][PS][DOCS] Fix broken link for `ps.sql` [spark]
HyukjinKwon commented on PR #50056: URL: https://github.com/apache/spark/pull/50056#issuecomment-2677664899 Merged to master and branch-4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51300][PS][DOCS] Fix broken link for `ps.sql` [spark]
HyukjinKwon closed pull request #50056: [SPARK-51300][PS][DOCS] Fix broken link for `ps.sql` URL: https://github.com/apache/spark/pull/50056 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SPARK-51302][CONNECT] Spark Connect supports JDBC should use the DataFrameReader API [spark]
beliefer opened a new pull request, #50059: URL: https://github.com/apache/spark/pull/50059 ### What changes were proposed in this pull request? This PR proposes to unify the calling to the DataFrameReader API in Spark Connect where supports the jdbc API. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
LuciferYang commented on code in PR #50057: URL: https://github.com/apache/spark/pull/50057#discussion_r1967181996 ## core/benchmarks/ZStandardBenchmark-results.txt: ## @@ -2,48 +2,48 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool661 662 1 0.0 66080.2 1.0X -Compression 1 times at level 2 without buffer pool701 702 1 0.0 70111.0 0.9X -Compression 1 times at level 3 without buffer pool792 796 5 0.0 79224.8 0.8X -Compression 1 times at level 1 with buffer pool 573 573 0 0.0 57276.4 1.2X -Compression 1 times at level 2 with buffer pool 602 602 0 0.0 60206.9 1.1X -Compression 1 times at level 3 with buffer pool 707 707 1 0.0 70665.0 0.9X +Compression 1 times at level 1 without buffer pool266 267 2 0.0 26572.9 1.0X +Compression 1 times at level 2 without buffer pool691 692 1 0.0 69069.2 0.4X +Compression 1 times at level 3 without buffer pool792 802 10 0.0 79240.5 0.3X +Compression 1 times at level 1 with buffer pool 572 573 1 0.0 57238.7 0.5X +Compression 1 times at level 2 with buffer pool 604 605 2 0.0 60369.3 0.4X +Compression 1 times at level 3 with buffer pool 735 738 2 0.0 73544.0 0.4X -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Decompression 1 times from level 1 without buffer pool635 635 0 0.0 63471.5 1.0X -Decompression 1 times from level 2 without buffer pool637 638 1 0.0 63693.2 1.0X -Decompression 1 times from level 3 without buffer pool637 638 1 0.0 63687.9 1.0X -Decompression 1 times from level 1 with buffer pool 545 545 0 0.0 54463.9 1.2X -Decompression 1 times from level 2 with buffer pool 544 545 1 0.0 54405.3 1.2X -Decompression 1 times from level 3 with buffer pool 544 545 1 0.0 54399.6 1.2X +Decompression 1 times from level 1 without buffer pool595 595 0 0.0 59493.7 1.0X +Decompression 1 times from level 2 without buffer pool596 596 1 0.0 59575.1 1.0X +Decompression 1 times from level 3 without buffer pool596 597 1 0.0 59607.5 1.0X +Decompression 1 times from level 1 with buffer pool 541 542 1 0.0 54117.6 1.1X +Decompression 1 times from level 2 with buffer pool 541 542 1 0.0 54088.7 1.1X +Decompression 1 times from level 3 with buffer pool 541 542 0 0.0 54106.0 1.1X -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Parallel Compression at level 3: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Parallel Compres
Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
yaooqinn commented on code in PR #50057: URL: https://github.com/apache/spark/pull/50057#discussion_r1967210507 ## core/benchmarks/ZStandardBenchmark-results.txt: ## @@ -2,48 +2,48 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool661 662 1 0.0 66080.2 1.0X -Compression 1 times at level 2 without buffer pool701 702 1 0.0 70111.0 0.9X -Compression 1 times at level 3 without buffer pool792 796 5 0.0 79224.8 0.8X -Compression 1 times at level 1 with buffer pool 573 573 0 0.0 57276.4 1.2X -Compression 1 times at level 2 with buffer pool 602 602 0 0.0 60206.9 1.1X -Compression 1 times at level 3 with buffer pool 707 707 1 0.0 70665.0 0.9X +Compression 1 times at level 1 without buffer pool266 267 2 0.0 26572.9 1.0X +Compression 1 times at level 2 without buffer pool691 692 1 0.0 69069.2 0.4X +Compression 1 times at level 3 without buffer pool792 802 10 0.0 79240.5 0.3X +Compression 1 times at level 1 with buffer pool 572 573 1 0.0 57238.7 0.5X +Compression 1 times at level 2 with buffer pool 604 605 2 0.0 60369.3 0.4X +Compression 1 times at level 3 with buffer pool 735 738 2 0.0 73544.0 0.4X -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Decompression 1 times from level 1 without buffer pool635 635 0 0.0 63471.5 1.0X -Decompression 1 times from level 2 without buffer pool637 638 1 0.0 63693.2 1.0X -Decompression 1 times from level 3 without buffer pool637 638 1 0.0 63687.9 1.0X -Decompression 1 times from level 1 with buffer pool 545 545 0 0.0 54463.9 1.2X -Decompression 1 times from level 2 with buffer pool 544 545 1 0.0 54405.3 1.2X -Decompression 1 times from level 3 with buffer pool 544 545 1 0.0 54399.6 1.2X +Decompression 1 times from level 1 without buffer pool595 595 0 0.0 59493.7 1.0X +Decompression 1 times from level 2 without buffer pool596 596 1 0.0 59575.1 1.0X +Decompression 1 times from level 3 without buffer pool596 597 1 0.0 59607.5 1.0X +Decompression 1 times from level 1 with buffer pool 541 542 1 0.0 54117.6 1.1X +Decompression 1 times from level 2 with buffer pool 541 542 1 0.0 54088.7 1.1X +Decompression 1 times from level 3 with buffer pool 541 542 0 0.0 54106.0 1.1X -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Parallel Compression at level 3: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Parallel Compressio
Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
LuciferYang commented on code in PR #50057: URL: https://github.com/apache/spark/pull/50057#discussion_r1967217142 ## core/benchmarks/ZStandardBenchmark-results.txt: ## @@ -2,48 +2,48 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool661 662 1 0.0 66080.2 1.0X -Compression 1 times at level 2 without buffer pool701 702 1 0.0 70111.0 0.9X -Compression 1 times at level 3 without buffer pool792 796 5 0.0 79224.8 0.8X -Compression 1 times at level 1 with buffer pool 573 573 0 0.0 57276.4 1.2X -Compression 1 times at level 2 with buffer pool 602 602 0 0.0 60206.9 1.1X -Compression 1 times at level 3 with buffer pool 707 707 1 0.0 70665.0 0.9X +Compression 1 times at level 1 without buffer pool266 267 2 0.0 26572.9 1.0X +Compression 1 times at level 2 without buffer pool691 692 1 0.0 69069.2 0.4X +Compression 1 times at level 3 without buffer pool792 802 10 0.0 79240.5 0.3X +Compression 1 times at level 1 with buffer pool 572 573 1 0.0 57238.7 0.5X +Compression 1 times at level 2 with buffer pool 604 605 2 0.0 60369.3 0.4X +Compression 1 times at level 3 with buffer pool 735 738 2 0.0 73544.0 0.4X -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Decompression 1 times from level 1 without buffer pool635 635 0 0.0 63471.5 1.0X -Decompression 1 times from level 2 without buffer pool637 638 1 0.0 63693.2 1.0X -Decompression 1 times from level 3 without buffer pool637 638 1 0.0 63687.9 1.0X -Decompression 1 times from level 1 with buffer pool 545 545 0 0.0 54463.9 1.2X -Decompression 1 times from level 2 with buffer pool 544 545 1 0.0 54405.3 1.2X -Decompression 1 times from level 3 with buffer pool 544 545 1 0.0 54399.6 1.2X +Decompression 1 times from level 1 without buffer pool595 595 0 0.0 59493.7 1.0X +Decompression 1 times from level 2 without buffer pool596 596 1 0.0 59575.1 1.0X +Decompression 1 times from level 3 without buffer pool596 597 1 0.0 59607.5 1.0X +Decompression 1 times from level 1 with buffer pool 541 542 1 0.0 54117.6 1.1X +Decompression 1 times from level 2 with buffer pool 541 542 1 0.0 54088.7 1.1X +Decompression 1 times from level 3 with buffer pool 541 542 0 0.0 54106.0 1.1X -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Parallel Compression at level 3: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Parallel Compres
[PR] Add rpad and lpad support for PostgreDialect and MsSQLServerDialect [spark]
milosstojanovic opened a new pull request, #50060: URL: https://github.com/apache/spark/pull/50060 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50856][SS][PYTHON][CONNECT] Spark Connect Support for TransformWithStateInPandas In Python [spark]
haiyangsun-db commented on code in PR #49560: URL: https://github.com/apache/spark/pull/49560#discussion_r1967297789 ## sql/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -1031,6 +1031,26 @@ message GroupMap { // (Optional) The schema for the grouped state. optional DataType state_schema = 10; + + // Below fields are used by TransformWithState and TransformWithStateInPandas + // (Optional) TransformWithState related parameters. + optional TransformWithStateInfo transform_with_state_info = 11; +} + +// Additional input parameters used for TransformWithState operator. +message TransformWithStateInfo { Review Comment: `GroupMap` may also be used for a normal "flatMapGroups" without state - so ideally we should have just one optional field in "GroupMap" such as: ``` oneof stateInfo { MapGroupsWithStateInfo = xx; TransformWithStateInfo = yy; } ``` And the stateInfo would clearly indicate three different types: 1. when it is none, it is a normal "flatMapGroups" 2. when it is MapGroupsWithStateInfo, it's for "MapGroupsWithState" or "FlatMapGroupsWithState" 3. otherwise, it's for transformWithState. This would look better, however, it is not backward compatible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [MINOR][TESTS] Fix test errors caused by improper DROP TABLE/VIEW in describe.sql [spark]
yaooqinn opened a new pull request, #50061: URL: https://github.com/apache/spark/pull/50061 ### What changes were proposed in this pull request? This PR fixes test errors caused by improper DROP TABLE/VIEW in describe.sql - Table Not Found Error when dropping views after dropping the base table - Table Already Exists Error w/o proper drop ### Why are the changes needed? test fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? the existing tests ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Add rpad and lpad support for PostgreDialect and MsSQLServerDialect [spark]
milosstojanovic closed pull request #50060: Add rpad and lpad support for PostgreDialect and MsSQLServerDialect URL: https://github.com/apache/spark/pull/50060 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51278][PYTHON] Use appropriate structure of JSON format for `PySparkLogger` [spark]
itholic closed pull request #50038: [SPARK-51278][PYTHON] Use appropriate structure of JSON format for `PySparkLogger` URL: https://github.com/apache/spark/pull/50038 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51278][PYTHON] Use appropriate structure of JSON format for `PySparkLogger` [spark]
itholic commented on PR #50038: URL: https://github.com/apache/spark/pull/50038#issuecomment-2677920582 Merged to master and branch-4.0. Thanks @ueshin @HyukjinKwon for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
beliefer commented on code in PR #50057: URL: https://github.com/apache/spark/pull/50057#discussion_r1967317270 ## core/benchmarks/ZStandardBenchmark-results.txt: ## @@ -2,48 +2,48 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool661 662 1 0.0 66080.2 1.0X -Compression 1 times at level 2 without buffer pool701 702 1 0.0 70111.0 0.9X -Compression 1 times at level 3 without buffer pool792 796 5 0.0 79224.8 0.8X -Compression 1 times at level 1 with buffer pool 573 573 0 0.0 57276.4 1.2X -Compression 1 times at level 2 with buffer pool 602 602 0 0.0 60206.9 1.1X -Compression 1 times at level 3 with buffer pool 707 707 1 0.0 70665.0 0.9X +Compression 1 times at level 1 without buffer pool266 267 2 0.0 26572.9 1.0X +Compression 1 times at level 2 without buffer pool691 692 1 0.0 69069.2 0.4X +Compression 1 times at level 3 without buffer pool792 802 10 0.0 79240.5 0.3X +Compression 1 times at level 1 with buffer pool 572 573 1 0.0 57238.7 0.5X +Compression 1 times at level 2 with buffer pool 604 605 2 0.0 60369.3 0.4X +Compression 1 times at level 3 with buffer pool 735 738 2 0.0 73544.0 0.4X -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Decompression 1 times from level 1 without buffer pool635 635 0 0.0 63471.5 1.0X -Decompression 1 times from level 2 without buffer pool637 638 1 0.0 63693.2 1.0X -Decompression 1 times from level 3 without buffer pool637 638 1 0.0 63687.9 1.0X -Decompression 1 times from level 1 with buffer pool 545 545 0 0.0 54463.9 1.2X -Decompression 1 times from level 2 with buffer pool 544 545 1 0.0 54405.3 1.2X -Decompression 1 times from level 3 with buffer pool 544 545 1 0.0 54399.6 1.2X +Decompression 1 times from level 1 without buffer pool595 595 0 0.0 59493.7 1.0X +Decompression 1 times from level 2 without buffer pool596 596 1 0.0 59575.1 1.0X +Decompression 1 times from level 3 without buffer pool596 597 1 0.0 59607.5 1.0X +Decompression 1 times from level 1 with buffer pool 541 542 1 0.0 54117.6 1.1X +Decompression 1 times from level 2 with buffer pool 541 542 1 0.0 54088.7 1.1X +Decompression 1 times from level 3 with buffer pool 541 542 0 0.0 54106.0 1.1X -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Parallel Compression at level 3: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Parallel Compressio
[PR] [SPARK-51304][DOCS][PYTHON] Use `getCondition` instead of `getErrorClass` in contribution guide [spark]
itholic opened a new pull request, #50062: URL: https://github.com/apache/spark/pull/50062 ### What changes were proposed in this pull request? This PR proposes to use `getCondition` instead of `getErrorClass` in contribution guide ### Why are the changes needed? We deprecated `getErrorClass` ### Does this PR introduce _any_ user-facing change? No API changes, but the user-facing doc will be updated ### How was this patch tested? The existing doc build CI ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [DRAFT] Resolve default string producing expressions [spark]
stefankandic commented on code in PR #50053: URL: https://github.com/apache/spark/pull/50053#discussion_r1967363581 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDDLCommandStringTypes.scala: ## @@ -28,6 +29,10 @@ import org.apache.spark.sql.types.{DataType, StringType} * collation from the corresponding object (table/view -> schema -> catalog). */ object ResolveDDLCommandStringTypes extends Rule[LogicalPlan] { + // Tag to mark expressions that have been cast to a new type so that we can + // avoid infinite recursion when resolving the same expression multiple times. + private val CAST_ADDED_TAG = new TreeNodeTag[Unit]("defaultStringExpressionCastAdded") Review Comment: Is there an alternative approach which doesn't lead to adding the cast infinitely? cc: @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-50098][PYTHON][FOLLOW-UP] Update _minimum_googleapis_common_protos_version in setup.py for pyspark-client [spark]
HyukjinKwon opened a new pull request, #50063: URL: https://github.com/apache/spark/pull/50063 ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/48643 that updates _minimum_googleapis_common_protos_version in setup.py for pyspark-client ### Why are the changes needed? To match the version with pyspark. ### Does this PR introduce _any_ user-facing change? No, `pyspark-client` has not been released yet. ### How was this patch tested? It will be tested in "Debug Build / Spark Connect Python-only (master, Python 3.11) " build. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-50015][PYTHON][FOLLOW-UP] Update _minimum_grpc_version in setup.py for pyspark-client [spark]
HyukjinKwon opened a new pull request, #50064: URL: https://github.com/apache/spark/pull/50064 ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/48524 that updates _minimum_grpc_version in setup.py for pyspark-client ### Why are the changes needed? To match the version with pyspark. ### Does this PR introduce _any_ user-facing change? No, `pyspark-client` has not been released yet. ### How was this patch tested? It will be tested in "Debug Build / Spark Connect Python-only (master, Python 3.11) " build. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50098][PYTHON][FOLLOW-UP] Update _minimum_googleapis_common_protos_version in setup.py for pyspark-client [spark]
HyukjinKwon commented on PR #50063: URL: https://github.com/apache/spark/pull/50063#issuecomment-2678115467 Merged to master and branch-4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50098][PYTHON][FOLLOW-UP] Update _minimum_googleapis_common_protos_version in setup.py for pyspark-client [spark]
HyukjinKwon closed pull request #50063: [SPARK-50098][PYTHON][FOLLOW-UP] Update _minimum_googleapis_common_protos_version in setup.py for pyspark-client URL: https://github.com/apache/spark/pull/50063 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50015][PYTHON][FOLLOW-UP] Update _minimum_grpc_version in setup.py for pyspark-client [spark]
HyukjinKwon commented on PR #50064: URL: https://github.com/apache/spark/pull/50064#issuecomment-2678117982 Merged to master and brnach-4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51302][CONNECT] Spark Connect supports JDBC should use the DataFrameReader API [spark]
beliefer commented on PR #50059: URL: https://github.com/apache/spark/pull/50059#issuecomment-2678117446 ping @HyukjinKwon @zhengruifeng @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-50795][SQL][FOLLOWUP] Set isParsing to false for the timestamp formatter in DESCRIBE AS JSON [spark]
yaooqinn opened a new pull request, #50065: URL: https://github.com/apache/spark/pull/50065 ### What changes were proposed in this pull request? This PR set isParsing to false for the timestamp formatter in DESCRIBE AS JSON, because the formatter is not used for parsing datetime strings ### Why are the changes needed? Although it does not affect the final output due to the current fmt we use now being w/o 'S' portion, it can prevent potential bugs if we store/display higher-precision timestamps. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50015][PYTHON][FOLLOW-UP] Update _minimum_grpc_version in setup.py for pyspark-client [spark]
HyukjinKwon closed pull request #50064: [SPARK-50015][PYTHON][FOLLOW-UP] Update _minimum_grpc_version in setup.py for pyspark-client URL: https://github.com/apache/spark/pull/50064 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50785][SQL] Refactor FOR statement to utilize local variables properly. [spark]
dusantism-db commented on code in PR #50026: URL: https://github.com/apache/spark/pull/50026#discussion_r1967448405 ## sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala: ## @@ -206,6 +207,15 @@ class TriggerToExceptionHandlerMap( def getNotFoundHandler: Option[ExceptionHandlerExec] = notFoundHandler } +object TriggerToExceptionHandlerMap { + def empty: TriggerToExceptionHandlerMap = new TriggerToExceptionHandlerMap( Review Comment: This is the definition of a helper method which returns a new `TriggerToExceptionHandlerMap`. There is no state, we are using `def` not `val` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-50785][SQL] Refactor FOR statement to utilize local variables properly. [spark]
dusantism-db commented on code in PR #50026: URL: https://github.com/apache/spark/pull/50026#discussion_r1967452308 ## sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala: ## @@ -206,6 +207,15 @@ class TriggerToExceptionHandlerMap( def getNotFoundHandler: Option[ExceptionHandlerExec] = notFoundHandler } +object TriggerToExceptionHandlerMap { + def empty: TriggerToExceptionHandlerMap = new TriggerToExceptionHandlerMap( Review Comment: Maybe it should have parenthesis for clarity - `empty()` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
yaooqinn commented on code in PR #50057: URL: https://github.com/apache/spark/pull/50057#discussion_r1967455912 ## core/benchmarks/ZStandardBenchmark-results.txt: ## @@ -2,48 +2,48 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool661 662 1 0.0 66080.2 1.0X -Compression 1 times at level 2 without buffer pool701 702 1 0.0 70111.0 0.9X -Compression 1 times at level 3 without buffer pool792 796 5 0.0 79224.8 0.8X -Compression 1 times at level 1 with buffer pool 573 573 0 0.0 57276.4 1.2X -Compression 1 times at level 2 with buffer pool 602 602 0 0.0 60206.9 1.1X -Compression 1 times at level 3 with buffer pool 707 707 1 0.0 70665.0 0.9X +Compression 1 times at level 1 without buffer pool266 267 2 0.0 26572.9 1.0X +Compression 1 times at level 2 without buffer pool691 692 1 0.0 69069.2 0.4X +Compression 1 times at level 3 without buffer pool792 802 10 0.0 79240.5 0.3X +Compression 1 times at level 1 with buffer pool 572 573 1 0.0 57238.7 0.5X +Compression 1 times at level 2 with buffer pool 604 605 2 0.0 60369.3 0.4X +Compression 1 times at level 3 with buffer pool 735 738 2 0.0 73544.0 0.4X -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Decompression 1 times from level 1 without buffer pool635 635 0 0.0 63471.5 1.0X -Decompression 1 times from level 2 without buffer pool637 638 1 0.0 63693.2 1.0X -Decompression 1 times from level 3 without buffer pool637 638 1 0.0 63687.9 1.0X -Decompression 1 times from level 1 with buffer pool 545 545 0 0.0 54463.9 1.2X -Decompression 1 times from level 2 with buffer pool 544 545 1 0.0 54405.3 1.2X -Decompression 1 times from level 3 with buffer pool 544 545 1 0.0 54399.6 1.2X +Decompression 1 times from level 1 without buffer pool595 595 0 0.0 59493.7 1.0X +Decompression 1 times from level 2 without buffer pool596 596 1 0.0 59575.1 1.0X +Decompression 1 times from level 3 without buffer pool596 597 1 0.0 59607.5 1.0X +Decompression 1 times from level 1 with buffer pool 541 542 1 0.0 54117.6 1.1X +Decompression 1 times from level 2 with buffer pool 541 542 1 0.0 54088.7 1.1X +Decompression 1 times from level 3 with buffer pool 541 542 0 0.0 54106.0 1.1X -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Parallel Compression at level 3: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Parallel Compressio
Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
pan3793 commented on code in PR #50057: URL: https://github.com/apache/spark/pull/50057#discussion_r1967470537 ## core/benchmarks/ZStandardBenchmark-results.txt: ## @@ -2,48 +2,48 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool661 662 1 0.0 66080.2 1.0X -Compression 1 times at level 2 without buffer pool701 702 1 0.0 70111.0 0.9X -Compression 1 times at level 3 without buffer pool792 796 5 0.0 79224.8 0.8X -Compression 1 times at level 1 with buffer pool 573 573 0 0.0 57276.4 1.2X -Compression 1 times at level 2 with buffer pool 602 602 0 0.0 60206.9 1.1X -Compression 1 times at level 3 with buffer pool 707 707 1 0.0 70665.0 0.9X +Compression 1 times at level 1 without buffer pool266 267 2 0.0 26572.9 1.0X +Compression 1 times at level 2 without buffer pool691 692 1 0.0 69069.2 0.4X +Compression 1 times at level 3 without buffer pool792 802 10 0.0 79240.5 0.3X +Compression 1 times at level 1 with buffer pool 572 573 1 0.0 57238.7 0.5X +Compression 1 times at level 2 with buffer pool 604 605 2 0.0 60369.3 0.4X +Compression 1 times at level 3 with buffer pool 735 738 2 0.0 73544.0 0.4X -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Decompression 1 times from level 1 without buffer pool635 635 0 0.0 63471.5 1.0X -Decompression 1 times from level 2 without buffer pool637 638 1 0.0 63693.2 1.0X -Decompression 1 times from level 3 without buffer pool637 638 1 0.0 63687.9 1.0X -Decompression 1 times from level 1 with buffer pool 545 545 0 0.0 54463.9 1.2X -Decompression 1 times from level 2 with buffer pool 544 545 1 0.0 54405.3 1.2X -Decompression 1 times from level 3 with buffer pool 544 545 1 0.0 54399.6 1.2X +Decompression 1 times from level 1 without buffer pool595 595 0 0.0 59493.7 1.0X +Decompression 1 times from level 2 without buffer pool596 596 1 0.0 59575.1 1.0X +Decompression 1 times from level 3 without buffer pool596 597 1 0.0 59607.5 1.0X +Decompression 1 times from level 1 with buffer pool 541 542 1 0.0 54117.6 1.1X +Decompression 1 times from level 2 with buffer pool 541 542 1 0.0 54088.7 1.1X +Decompression 1 times from level 3 with buffer pool 541 542 0 0.0 54106.0 1.1X -OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 17.0.14+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Parallel Compression at level 3: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Parallel Compression
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
Madhukar525722 commented on PR #50022: URL: https://github.com/apache/spark/pull/50022#issuecomment-2678157954 HI @pan3793 , while testing we are facing a warning `[spark3-client]$ spark-sql --master yarn --deploy-mode client --driver-memory 4g --executor-memory 4g --conf spark.hadoop.hive.thrift.client.max.message.size=1gb` `Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/02/24 07:45:46 WARN HiveConf: HiveConf of name hive.thrift.client.max.message.size does not exist 25/02/24 07:45:49 WARN HiveConf: HiveConf of name hive.thrift.client.max.message.size does not exist` We have already defined the conf in hive-site.xml ` hive.thrift.client.max.message.size 1gb ` So, the error message for table persists. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
pan3793 commented on PR #50022: URL: https://github.com/apache/spark/pull/50022#issuecomment-2678172344 @Madhukar525722 so it works? it's a warning message, not an error, and it seems reasonable to me. > 07:45:49 WARN HiveConf: HiveConf of name hive.thrift.client.max.message.size does not exist BTW, use three backticks to quote the code blocks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
Madhukar525722 commented on PR #50022: URL: https://github.com/apache/spark/pull/50022#issuecomment-2678181886 It didnt worked @pan3793 , I am suspecting that the config setup didnt happend thats why it resulted in the same old behaviour. Apart from that other error logs are still same -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51305][CONNECT] Improve the code for createObservedMetricsResponse [spark]
beliefer opened a new pull request, #50066: URL: https://github.com/apache/spark/pull/50066 ### What changes were proposed in this pull request? This PR proposes to improve the code for `createObservedMetricsResponse`. ### Why are the changes needed? There exists a duplicate judgement in loop, we should eliminate it. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
pan3793 commented on code in PR #50022: URL: https://github.com/apache/spark/pull/50022#discussion_r1967508152 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1407,13 +1408,74 @@ private[hive] object HiveClientImpl extends Logging { case _ => new HiveConf(conf, classOf[HiveConf]) } -try { +val hive = try { Hive.getWithoutRegisterFns(hiveConf) } catch { // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but - // 2.3.8 don't), therefore here we fallback when encountering the exception. + // 2.3.8 doesn't), therefore here we fallback when encountering the exception. case _: NoSuchMethodError => Hive.get(hiveConf) } + +// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size when +// `hive.thrift.client.max.message.size` is set and the value is positive +Option(hiveConf.get("hive.thrift.client.max.message.size")) + .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0) + .foreach { maxMessageSize => +logDebug(s"Trying to set metastore client thrift max message to $maxMessageSize") +configureMaxThriftMessageSize(hiveConf, hive.getMSC, maxMessageSize) + } + +hive + } + + private def getFieldValue[T](obj: Any, fieldName: String): T = { +val field = obj.getClass.getDeclaredField(fieldName) +field.setAccessible(true) +field.get(obj).asInstanceOf[T] + } + + private def findMethod(klass: Class[_], name: String, args: Class[_]*): Method = { +val method = klass.getDeclaredMethod(name, args: _*) +method.setAccessible(true) +method + } + + // SPARK-49489: a surgery for Hive 2.3.10 due to lack of HIVE-26633 + private def configureMaxThriftMessageSize( + hiveConf: HiveConf, msClient: IMetaStoreClient, maxMessageSize: Int): Unit = try { +msClient match { + // Hive uses Java Dynamic Proxy to enhance the MetaStoreClient to support synchronization + // and retrying, we should unwrap and access the real MetaStoreClient instance firstly + case proxy if JdkProxy.isProxyClass(proxy.getClass) => +JdkProxy.getInvocationHandler(proxy) match { + case syncHandler if syncHandler.getClass.getName.endsWith("SynchronizedHandler") => +val realMsc = getFieldValue[IMetaStoreClient](syncHandler, "client") +configureMaxThriftMessageSize(hiveConf, realMsc, maxMessageSize) + case retryHandler: RetryingMetaStoreClient => +val realMsc = getFieldValue[IMetaStoreClient](retryHandler, "base") +configureMaxThriftMessageSize(hiveConf, realMsc, maxMessageSize) + case _ => +} + case msc: HiveMetaStoreClient if !msc.isLocalMetaStore => +msc.getTTransport match { + case t: TEndpointTransport => +val currentMaxMessageSize = t.getConfiguration.getMaxMessageSize +if (currentMaxMessageSize != maxMessageSize) { + logDebug("Change the current metastore client thrift max message size from " + Review Comment: @Madhukar525722 could you please try the latest patch, and monitor if this log is printed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
pan3793 commented on code in PR #50022: URL: https://github.com/apache/spark/pull/50022#discussion_r1967508738 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1407,13 +1408,74 @@ private[hive] object HiveClientImpl extends Logging { case _ => new HiveConf(conf, classOf[HiveConf]) } -try { +val hive = try { Hive.getWithoutRegisterFns(hiveConf) } catch { // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but - // 2.3.8 don't), therefore here we fallback when encountering the exception. + // 2.3.8 doesn't), therefore here we fallback when encountering the exception. case _: NoSuchMethodError => Hive.get(hiveConf) } + +// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size when +// `hive.thrift.client.max.message.size` is set and the value is positive +Option(hiveConf.get("hive.thrift.client.max.message.size")) + .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0) + .foreach { maxMessageSize => +logDebug(s"Trying to set metastore client thrift max message to $maxMessageSize") Review Comment: @Madhukar525722 also this one -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
beliefer commented on code in PR #50057: URL: https://github.com/apache/spark/pull/50057#discussion_r1967516710 ## core/benchmarks/ZStandardBenchmark-jdk21-results.txt: ## @@ -2,48 +2,48 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool656 668 12 0.0 65591.6 1.0X -Compression 1 times at level 2 without buffer pool709 711 2 0.0 70934.6 0.9X -Compression 1 times at level 3 without buffer pool814 818 5 0.0 81370.9 0.8X -Compression 1 times at level 1 with buffer pool 601 603 2 0.0 60100.1 1.1X -Compression 1 times at level 2 with buffer pool 634 636 2 0.0 63449.9 1.0X -Compression 1 times at level 3 with buffer pool 748 753 5 0.0 74789.7 0.9X +Compression 1 times at level 1 without buffer pool650 666 11 0.0 65044.6 1.0X +Compression 1 times at level 2 without buffer pool701 703 2 0.0 70071.2 0.9X +Compression 1 times at level 3 without buffer pool804 807 2 0.0 80430.6 0.8X +Compression 1 times at level 1 with buffer pool 598 599 1 0.0 59810.5 1.1X +Compression 1 times at level 2 with buffer pool 631 640 17 0.0 63096.7 1.0X +Compression 1 times at level 3 with buffer pool 751 753 4 0.0 75051.1 0.9X -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Decompression 1 times from level 1 without buffer pool817 818 1 0.0 81723.2 1.0X -Decompression 1 times from level 2 without buffer pool817 818 1 0.0 81729.4 1.0X -Decompression 1 times from level 3 without buffer pool817 818 1 0.0 81719.7 1.0X -Decompression 1 times from level 1 with buffer pool 749 757 14 0.0 74864.9 1.1X -Decompression 1 times from level 2 with buffer pool 748 749 1 0.0 74789.1 1.1X -Decompression 1 times from level 3 with buffer pool 748 749 1 0.0 74811.6 1.1X +Decompression 1 times from level 1 without buffer pool823 831 14 0.0 82254.2 1.0X +Decompression 1 times from level 2 without buffer pool824 827 5 0.0 82359.1 1.0X +Decompression 1 times from level 3 without buffer pool819 821 2 0.0 81925.1 1.0X +Decompression 1 times from level 1 with buffer pool 753 755 3 0.0 75284.1 1.1X +Decompression 1 times from level 2 with buffer pool 753 755 1 0.0 75344.7 1.1X +Decompression 1 times from level 3 with buffer pool 753 754 1 0.0 75263.0 1.1X -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Parallel Compression at level 3: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Parallel Compressio
Re: [PR] [WIP][SQL] Add rpad and lpad support for PostgresDialect and MsSQLServerDialect expression pushdown [spark]
beliefer commented on PR #50060: URL: https://github.com/apache/spark/pull/50060#issuecomment-2678235174 Please create an issue for track down. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-51078][SPARK-50963][ML][PYTHON][CONNECT][TESTS][FOLLOW-UP] Add back tests for default value [spark]
zhengruifeng opened a new pull request, #50067: URL: https://github.com/apache/spark/pull/50067 ### What changes were proposed in this pull request? add back tests deleted in https://github.com/apache/spark/commit/e0a7db2d2a7d295f933f9fc2d5605c5e59c58aa7#diff-50e109673576cc6d4f8727f54076b0884b93b759c49d98c908581db7093cb5ab ### Why are the changes needed? for coverage, `StopWordRemover`'s default value implementation is special ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
Madhukar525722 commented on code in PR #50022: URL: https://github.com/apache/spark/pull/50022#discussion_r1967534275 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1407,13 +1408,74 @@ private[hive] object HiveClientImpl extends Logging { case _ => new HiveConf(conf, classOf[HiveConf]) } -try { +val hive = try { Hive.getWithoutRegisterFns(hiveConf) } catch { // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but - // 2.3.8 don't), therefore here we fallback when encountering the exception. + // 2.3.8 doesn't), therefore here we fallback when encountering the exception. case _: NoSuchMethodError => Hive.get(hiveConf) } + +// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size when +// `hive.thrift.client.max.message.size` is set and the value is positive +Option(hiveConf.get("hive.thrift.client.max.message.size")) + .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0) + .foreach { maxMessageSize => +logDebug(s"Trying to set metastore client thrift max message to $maxMessageSize") Review Comment: @pan3793 , when I enabled the debug log and try to run, I can see ```25/02/24 12:11:30 DEBUG HiveClientImpl: Trying to set metastore client thrift max message to 1073741824``` But the logs related to Change the current metastore client thrift max message size from, is not there -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
Madhukar525722 commented on code in PR #50022: URL: https://github.com/apache/spark/pull/50022#discussion_r1967534275 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1407,13 +1408,74 @@ private[hive] object HiveClientImpl extends Logging { case _ => new HiveConf(conf, classOf[HiveConf]) } -try { +val hive = try { Hive.getWithoutRegisterFns(hiveConf) } catch { // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but - // 2.3.8 don't), therefore here we fallback when encountering the exception. + // 2.3.8 doesn't), therefore here we fallback when encountering the exception. case _: NoSuchMethodError => Hive.get(hiveConf) } + +// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size when +// `hive.thrift.client.max.message.size` is set and the value is positive +Option(hiveConf.get("hive.thrift.client.max.message.size")) + .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0) + .foreach { maxMessageSize => +logDebug(s"Trying to set metastore client thrift max message to $maxMessageSize") Review Comment: @pan3793 , I have build using latest patch only. And when I enabled the debug log and try to run, I can see ```25/02/24 12:11:30 DEBUG HiveClientImpl: Trying to set metastore client thrift max message to 1073741824``` But the logs related to Change the current metastore client thrift max message size from, is not there -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49489][SQL][HIVE] HMS client respects `hive.thrift.client.maxmessage.size` [spark]
pan3793 commented on code in PR #50022: URL: https://github.com/apache/spark/pull/50022#discussion_r1967538536 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1407,13 +1408,74 @@ private[hive] object HiveClientImpl extends Logging { case _ => new HiveConf(conf, classOf[HiveConf]) } -try { +val hive = try { Hive.getWithoutRegisterFns(hiveConf) } catch { // SPARK-37069: not all Hive versions have the above method (e.g., Hive 2.3.9 has it but - // 2.3.8 don't), therefore here we fallback when encountering the exception. + // 2.3.8 doesn't), therefore here we fallback when encountering the exception. case _: NoSuchMethodError => Hive.get(hiveConf) } + +// Follow behavior of HIVE-26633 (4.0.0), only apply the max message size when +// `hive.thrift.client.max.message.size` is set and the value is positive +Option(hiveConf.get("hive.thrift.client.max.message.size")) + .map(HiveConf.toSizeBytes(_).toInt).filter(_ > 0) + .foreach { maxMessageSize => +logDebug(s"Trying to set metastore client thrift max message to $maxMessageSize") Review Comment: @Madhukar525722 you may need to add some additional logs to debug what happened. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
pan3793 commented on code in PR #50057: URL: https://github.com/apache/spark/pull/50057#discussion_r1967556026 ## core/benchmarks/ZStandardBenchmark-jdk21-results.txt: ## @@ -2,48 +2,48 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool656 668 12 0.0 65591.6 1.0X -Compression 1 times at level 2 without buffer pool709 711 2 0.0 70934.6 0.9X -Compression 1 times at level 3 without buffer pool814 818 5 0.0 81370.9 0.8X -Compression 1 times at level 1 with buffer pool 601 603 2 0.0 60100.1 1.1X -Compression 1 times at level 2 with buffer pool 634 636 2 0.0 63449.9 1.0X -Compression 1 times at level 3 with buffer pool 748 753 5 0.0 74789.7 0.9X +Compression 1 times at level 1 without buffer pool650 666 11 0.0 65044.6 1.0X +Compression 1 times at level 2 without buffer pool701 703 2 0.0 70071.2 0.9X +Compression 1 times at level 3 without buffer pool804 807 2 0.0 80430.6 0.8X +Compression 1 times at level 1 with buffer pool 598 599 1 0.0 59810.5 1.1X +Compression 1 times at level 2 with buffer pool 631 640 17 0.0 63096.7 1.0X +Compression 1 times at level 3 with buffer pool 751 753 4 0.0 75051.1 0.9X -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Decompression 1 times from level 1 without buffer pool817 818 1 0.0 81723.2 1.0X -Decompression 1 times from level 2 without buffer pool817 818 1 0.0 81729.4 1.0X -Decompression 1 times from level 3 without buffer pool817 818 1 0.0 81719.7 1.0X -Decompression 1 times from level 1 with buffer pool 749 757 14 0.0 74864.9 1.1X -Decompression 1 times from level 2 with buffer pool 748 749 1 0.0 74789.1 1.1X -Decompression 1 times from level 3 with buffer pool 748 749 1 0.0 74811.6 1.1X +Decompression 1 times from level 1 without buffer pool823 831 14 0.0 82254.2 1.0X +Decompression 1 times from level 2 without buffer pool824 827 5 0.0 82359.1 1.0X +Decompression 1 times from level 3 without buffer pool819 821 2 0.0 81925.1 1.0X +Decompression 1 times from level 1 with buffer pool 753 755 3 0.0 75284.1 1.1X +Decompression 1 times from level 2 with buffer pool 753 755 1 0.0 75344.7 1.1X +Decompression 1 times from level 3 with buffer pool 753 754 1 0.0 75263.0 1.1X -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Parallel Compression at level 3: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Parallel Compression
Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
beliefer commented on code in PR #50057: URL: https://github.com/apache/spark/pull/50057#discussion_r1967569627 ## core/benchmarks/ZStandardBenchmark-jdk21-results.txt: ## @@ -2,48 +2,48 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool656 668 12 0.0 65591.6 1.0X -Compression 1 times at level 2 without buffer pool709 711 2 0.0 70934.6 0.9X -Compression 1 times at level 3 without buffer pool814 818 5 0.0 81370.9 0.8X -Compression 1 times at level 1 with buffer pool 601 603 2 0.0 60100.1 1.1X -Compression 1 times at level 2 with buffer pool 634 636 2 0.0 63449.9 1.0X -Compression 1 times at level 3 with buffer pool 748 753 5 0.0 74789.7 0.9X +Compression 1 times at level 1 without buffer pool650 666 11 0.0 65044.6 1.0X +Compression 1 times at level 2 without buffer pool701 703 2 0.0 70071.2 0.9X +Compression 1 times at level 3 without buffer pool804 807 2 0.0 80430.6 0.8X +Compression 1 times at level 1 with buffer pool 598 599 1 0.0 59810.5 1.1X +Compression 1 times at level 2 with buffer pool 631 640 17 0.0 63096.7 1.0X +Compression 1 times at level 3 with buffer pool 751 753 4 0.0 75051.1 0.9X -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Decompression 1 times from level 1 without buffer pool817 818 1 0.0 81723.2 1.0X -Decompression 1 times from level 2 without buffer pool817 818 1 0.0 81729.4 1.0X -Decompression 1 times from level 3 without buffer pool817 818 1 0.0 81719.7 1.0X -Decompression 1 times from level 1 with buffer pool 749 757 14 0.0 74864.9 1.1X -Decompression 1 times from level 2 with buffer pool 748 749 1 0.0 74789.1 1.1X -Decompression 1 times from level 3 with buffer pool 748 749 1 0.0 74811.6 1.1X +Decompression 1 times from level 1 without buffer pool823 831 14 0.0 82254.2 1.0X +Decompression 1 times from level 2 without buffer pool824 827 5 0.0 82359.1 1.0X +Decompression 1 times from level 3 without buffer pool819 821 2 0.0 81925.1 1.0X +Decompression 1 times from level 1 with buffer pool 753 755 3 0.0 75284.1 1.1X +Decompression 1 times from level 2 with buffer pool 753 755 1 0.0 75344.7 1.1X +Decompression 1 times from level 3 with buffer pool 753 754 1 0.0 75263.0 1.1X -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Parallel Compression at level 3: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Parallel Compressio
Re: [PR] [SPARK-51301][BUILD] Bump zstd-jni 1.5.7-1 [spark]
pan3793 commented on code in PR #50057: URL: https://github.com/apache/spark/pull/50057#discussion_r1967578779 ## core/benchmarks/ZStandardBenchmark-jdk21-results.txt: ## @@ -2,48 +2,48 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool656 668 12 0.0 65591.6 1.0X -Compression 1 times at level 2 without buffer pool709 711 2 0.0 70934.6 0.9X -Compression 1 times at level 3 without buffer pool814 818 5 0.0 81370.9 0.8X -Compression 1 times at level 1 with buffer pool 601 603 2 0.0 60100.1 1.1X -Compression 1 times at level 2 with buffer pool 634 636 2 0.0 63449.9 1.0X -Compression 1 times at level 3 with buffer pool 748 753 5 0.0 74789.7 0.9X +Compression 1 times at level 1 without buffer pool650 666 11 0.0 65044.6 1.0X +Compression 1 times at level 2 without buffer pool701 703 2 0.0 70071.2 0.9X +Compression 1 times at level 3 without buffer pool804 807 2 0.0 80430.6 0.8X +Compression 1 times at level 1 with buffer pool 598 599 1 0.0 59810.5 1.1X +Compression 1 times at level 2 with buffer pool 631 640 17 0.0 63096.7 1.0X +Compression 1 times at level 3 with buffer pool 751 753 4 0.0 75051.1 0.9X -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Decompression 1 times from level 1 without buffer pool817 818 1 0.0 81723.2 1.0X -Decompression 1 times from level 2 without buffer pool817 818 1 0.0 81729.4 1.0X -Decompression 1 times from level 3 without buffer pool817 818 1 0.0 81719.7 1.0X -Decompression 1 times from level 1 with buffer pool 749 757 14 0.0 74864.9 1.1X -Decompression 1 times from level 2 with buffer pool 748 749 1 0.0 74789.1 1.1X -Decompression 1 times from level 3 with buffer pool 748 749 1 0.0 74811.6 1.1X +Decompression 1 times from level 1 without buffer pool823 831 14 0.0 82254.2 1.0X +Decompression 1 times from level 2 without buffer pool824 827 5 0.0 82359.1 1.0X +Decompression 1 times from level 3 without buffer pool819 821 2 0.0 81925.1 1.0X +Decompression 1 times from level 1 with buffer pool 753 755 3 0.0 75284.1 1.1X +Decompression 1 times from level 2 with buffer pool 753 755 1 0.0 75344.7 1.1X +Decompression 1 times from level 3 with buffer pool 753 754 1 0.0 75263.0 1.1X -OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1020-azure +OpenJDK 64-Bit Server VM 21.0.6+7-LTS on Linux 6.8.0-1021-azure AMD EPYC 7763 64-Core Processor Parallel Compression at level 3: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Parallel Compression
Re: [PR] [SPARK-49912] Refactor simple CASE statement to evaluate the case variable only once [spark]
dusantism-db commented on code in PR #50027: URL: https://github.com/apache/spark/pull/50027#discussion_r1967595134 ## sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala: ## @@ -599,6 +599,116 @@ class CaseStatementExec( } } +/** + * Executable node for SimpleCaseStatement. + * @param caseVariableExec Statement with which all conditionExpressions will be compared to. + * @param conditionExpressions Collection of expressions which correspond to WHEN clauses. + * @param conditionalBodies Collection of executable bodies that have a corresponding condition, + * in WHEN branches. + * @param elseBody Body that is executed if none of the conditions are met, i.e. ELSE branch. + * @param session Spark session that SQL script is executed within. + * @param context SqlScriptingExecutionContext keeps the execution state of current script. + */ +class SimpleCaseStatementExec( +caseVariableExec: SingleStatementExec, +conditionExpressions: Seq[Expression], +conditionalBodies: Seq[CompoundBodyExec], +elseBody: Option[CompoundBodyExec], +session: SparkSession, +context: SqlScriptingExecutionContext) extends NonLeafStatementExec { + private object CaseState extends Enumeration { +val Condition, Body = Value + } + + private var state = CaseState.Condition + var bodyExec: Option[CompoundBodyExec] = None + + var conditionBodyTupleIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = _ + private var caseVariableLiteral: Literal = _ + + private var isCacheValid = false + private def validateCache(): Unit = { +if (!isCacheValid) { + val values = caseVariableExec.buildDataFrame(session).collect() + caseVariableExec.isExecuted = true + + caseVariableLiteral = Literal(values.head.get(0)) + conditionBodyTupleIterator = createConditionBodyIterator + isCacheValid = true +} + } + + private def cachedCaseVariableLiteral: Literal = { +validateCache() +caseVariableLiteral + } + + private def cachedConditionBodyIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = { +validateCache() +conditionBodyTupleIterator + } + + private lazy val treeIterator: Iterator[CompoundStatementExec] = +new Iterator[CompoundStatementExec] { + override def hasNext: Boolean = state match { +case CaseState.Condition => cachedConditionBodyIterator.hasNext || elseBody.isDefined +case CaseState.Body => bodyExec.exists(_.getTreeIterator.hasNext) + } + + override def next(): CompoundStatementExec = state match { +case CaseState.Condition => + cachedConditionBodyIterator.nextOption() +.map { case (condStmt, body) => + if (evaluateBooleanCondition(session, condStmt)) { +bodyExec = Some(body) +state = CaseState.Body + } + condStmt +} +.orElse(elseBody.map { body => { + bodyExec = Some(body) + state = CaseState.Body + next() +}}) +.get +case CaseState.Body => bodyExec.get.getTreeIterator.next() + } +} + + private def createConditionBodyIterator: Iterator[(SingleStatementExec, CompoundBodyExec)] = +conditionExpressions.zip(conditionalBodies) + .iterator + .map { case (expr, body) => +val condition = Project( + Seq(Alias(EqualTo(cachedCaseVariableLiteral, expr), "condition")()), + OneRowRelation() +) +// We hack the Origin to provide more descriptive error messages. For example, if +// the case variable is 1 and the condition expression it's compared to is 5, we +// will get Origin with text "(1 = 5)". +val conditionText = condition.projectList.head.asInstanceOf[Alias].child.toString +val condStmt = new SingleStatementExec( + condition, + Origin(sqlText = Some(conditionText), +startIndex = Some(0), +stopIndex = Some(conditionText.length - 1)), Review Comment: I added the line propagation - https://github.com/apache/spark/pull/50027/commits/d06ab7310a1dd2fd1084943194d78fe3900845d6 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51290][SQL] Enable filling default values in DSv2 writes [spark]
cloud-fan commented on code in PR #50044: URL: https://github.com/apache/spark/pull/50044#discussion_r1967618587 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -3534,7 +3534,8 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor TableOutputResolver.suitableForByNameCheck(v2Write.isByName, expected = v2Write.table.output, queryOutput = v2Write.query.output) val projection = TableOutputResolver.resolveOutputColumns( - v2Write.table.name, v2Write.table.output, v2Write.query, v2Write.isByName, conf) + v2Write.table.name, v2Write.table.output, v2Write.query, v2Write.isByName, conf, + supportColDefaultValue = true) Review Comment: Yea true, Spark fills the default values during table writing and it works for all catalogs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org