Re: [PR] [SPARK-51443][SS] Fix singleVariantColumn in DSv2 and readStream. [spark]

2025-03-13 Thread via GitHub


cloud-fan closed pull request #50217: [SPARK-51443][SS] Fix singleVariantColumn 
in DSv2 and readStream.
URL: https://github.com/apache/spark/pull/50217


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51402][SQL][TESTS] Test TimeType in UDF [spark]

2025-03-13 Thread via GitHub


MaxGekk commented on PR #50194:
URL: https://github.com/apache/spark/pull/50194#issuecomment-2720623643

   +1, LGTM. Merging to master.
   Thank you, @calilisantos.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51441][SQL] Add DSv2 APIs for constraints [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #50253:
URL: https://github.com/apache/spark/pull/50253#discussion_r1993729490


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/constraints/Constraint.java:
##
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.constraints;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.expressions.NamedReference;
+import org.apache.spark.sql.connector.expressions.filter.Predicate;
+
+/**
+ * A constraint that defines valid states of data in a table.
+ *
+ * @since 4.1.0
+ */
+@Evolving
+public interface Constraint {
+  /**
+   * Returns the name of this constraint.
+   */
+  String name();
+
+  /**
+   * Returns the definition of this constraint in DDL format.
+   */
+  String toDDL();
+
+  /**
+   * Returns the state of this constraint.
+   */
+  ConstraintState state();
+
+  /**
+   * Creates a CHECK constraint with a search condition defined in SQL (Spark 
SQL dialect).
+   *
+   * @param name the constraint name
+   * @param sql the SQL representation of the search condition (Spark SQL 
dialect)
+   * @param state the constraint state
+   * @return a CHECK constraint with the provided configuration
+   */
+  static Check check(String name, String sql, ConstraintState state) {
+return new Check(name, sql, null /* no predicate */, state);

Review Comment:
   Shall we use Java `Optional.empty()` instead of `null`? IIRC, this `null` 
was not a part of SPIP, wan't it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51372][SQL] Introduce a builder pattern in TableCatalog [spark]

2025-03-13 Thread via GitHub


cloud-fan commented on PR #50137:
URL: https://github.com/apache/spark/pull/50137#issuecomment-2721388429

   OK so the requirement is:
   - adding a new item to table metadata should not require a new overload of 
`def createTable`
   - the existing `def createTable` implementation should fail if a new table 
feature is added, but should still work if the new feature is not used
   
   I think the builder should not be an interface, but a class that can have 
member variables and methods:
   ```
   class CreateTableBuilder {
 protected Column[] columns;
 ...
 CreateTableBuilder withColumns ...
 ...
 Table create...
   }
   ```
   
   Then we add a new method in `TableCatalog` to create a builder, with the 
default implementation to return the builtin builder.
   
   The workflow is Spark getting the builder from `TableCatalog`, calling 
`withXXX` methods to set up fields, and calling `create` at the end.
   
   When we add a new item (e.g. constraints), we add a new `withXXX` method and 
`supportsXXX` method in the builder. By default `supportsXXX` returns false, 
and `withXXX` throws an exception if `supportsXXX` is false. Users need to 
explicitly override `supportsXXX` to return true to adopt the new feature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51438][SQL] Make CatalystDataToProtobuf and ProtobufDataToCatalyst properly comparable and hashable [spark]

2025-03-13 Thread via GitHub


cloud-fan commented on PR #50212:
URL: https://github.com/apache/spark/pull/50212#issuecomment-2721282652

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on PR #50255:
URL: https://github.com/apache/spark/pull/50255#issuecomment-2721458327

   To @zhengruifeng , could you confirm this part 
https://github.com/apache/spark/pull/50255#pullrequestreview-2679800649 ?
   > this PR seems to affect branch-4.0 Daily CI. Could you confirm that? I'm 
wondering if this PR is targeting 4.0.0 although we don't change spark-rm yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51438][SQL] Make CatalystDataToProtobuf and ProtobufDataToCatalyst properly comparable and hashable [spark]

2025-03-13 Thread via GitHub


cloud-fan closed pull request #50212: [SPARK-51438][SQL] Make 
CatalystDataToProtobuf and ProtobufDataToCatalyst properly comparable and 
hashable
URL: https://github.com/apache/spark/pull/50212


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51338][INFRA] Add automated CI build for `connect-examples` [spark]

2025-03-13 Thread via GitHub


LuciferYang commented on code in PR #50187:
URL: https://github.com/apache/spark/pull/50187#discussion_r1991901327


##
connect-examples/server-library-example/pom.xml:
##
@@ -36,7 +36,8 @@
 UTF-8
 2.13
 2.13.15
-3.25.4
-4.0.0-preview2
+4.29.3
+4.1.0-SNAPSHOT
+33.4.0-jre

Review Comment:
   If feasible, it's certainly ok. However, I have a few questions regarding 
this: 
   
   1. How should the versions of other dependencies be updated? Do they need to 
be consistent with Spark? For instance, the current Spark uses Scala 2.13.16, 
but this project is still using 2.13.15. 
   
   2. During the release process, after changing the Spark version (e.g., from 
4.0.0-SNAPSHOT to 4.0.0), is it necessary to check the `build` of this project? 
   
   3. Since it aims to be independent project, why don't we choose to maintain 
this examples project in a separate branch(no Spark code whatsoever), or even 
create a separate repository like `spark-connect-examples`? If it is an 
independent repository, would it be more convenient to also include examples 
for clients in other programming languages, such as Go or Swift?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51441][SQL] Add DSv2 APIs for constraints [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #50253:
URL: https://github.com/apache/spark/pull/50253#discussion_r1993722344


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/constraints/BaseConstraint.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.constraints;
+
+import org.apache.spark.sql.connector.expressions.NamedReference;
+
+import java.util.StringJoiner;

Review Comment:
   Could you place this before `org.apache.spark.` import statements?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51359][CORE][SQL] Set INT64 as the default timestamp type for Parquet files [spark]

2025-03-13 Thread via GitHub


ganeshashree commented on PR #50215:
URL: https://github.com/apache/spark/pull/50215#issuecomment-2721580059

   > int96 support in parquet apache/iceberg#1138
   
   @HyukjinKwon Yes, this will be a breaking change for applications that 
depend on INT96 and use future versions of Spark. We are changing the default 
timestamp type for Parquet files in a future version of Spark.
   Applications using future versions of Spark to read or write Parquet files 
will need to configure Spark to handle INT96, as it will no longer be supported 
by default. We will need to document this change and highlight it in our 
release notes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation [spark]

2025-03-13 Thread via GitHub


cloud-fan commented on PR #50232:
URL: https://github.com/apache/spark/pull/50232#issuecomment-2721567771

   late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51441][SQL] Add DSv2 APIs for constraints [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #50253:
URL: https://github.com/apache/spark/pull/50253#discussion_r1993729490


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/constraints/Constraint.java:
##
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.constraints;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.expressions.NamedReference;
+import org.apache.spark.sql.connector.expressions.filter.Predicate;
+
+/**
+ * A constraint that defines valid states of data in a table.
+ *
+ * @since 4.1.0
+ */
+@Evolving
+public interface Constraint {
+  /**
+   * Returns the name of this constraint.
+   */
+  String name();
+
+  /**
+   * Returns the definition of this constraint in DDL format.
+   */
+  String toDDL();
+
+  /**
+   * Returns the state of this constraint.
+   */
+  ConstraintState state();
+
+  /**
+   * Creates a CHECK constraint with a search condition defined in SQL (Spark 
SQL dialect).
+   *
+   * @param name the constraint name
+   * @param sql the SQL representation of the search condition (Spark SQL 
dialect)
+   * @param state the constraint state
+   * @return a CHECK constraint with the provided configuration
+   */
+  static Check check(String name, String sql, ConstraintState state) {
+return new Check(name, sql, null /* no predicate */, state);

Review Comment:
   Shall we use Java `Optional.empty()` instead of `null`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51350][SQL] Implement Show Procedures [spark]

2025-03-13 Thread via GitHub


cloud-fan commented on code in PR #50109:
URL: https://github.com/apache/spark/pull/50109#discussion_r1993730085


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ProcedureCatalog.java:
##
@@ -34,4 +34,9 @@ public interface ProcedureCatalog extends CatalogPlugin {
* @return the loaded unbound procedure
*/
   UnboundProcedure loadProcedure(Identifier ident);
+
+  /**
+   * List all procedures in the specified database.
+   */
+  Identifier[] listProcedures(String[] namespace);

Review Comment:
   I think for methods that need to be implemented by users, `String[]` is 
better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51350][SQL] Implement Show Procedures [spark]

2025-03-13 Thread via GitHub


cloud-fan commented on code in PR #50109:
URL: https://github.com/apache/spark/pull/50109#discussion_r1993728678


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ProcedureCatalog.java:
##
@@ -34,4 +34,9 @@ public interface ProcedureCatalog extends CatalogPlugin {
* @return the loaded unbound procedure
*/
   UnboundProcedure loadProcedure(Identifier ident);
+
+  /**
+   * List all procedures in the specified database.

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51441][SQL] Add DSv2 APIs for constraints [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #50253:
URL: https://github.com/apache/spark/pull/50253#discussion_r1993731273


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/constraints/Constraint.java:
##
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.constraints;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.expressions.NamedReference;
+import org.apache.spark.sql.connector.expressions.filter.Predicate;
+
+/**
+ * A constraint that defines valid states of data in a table.
+ *
+ * @since 4.1.0
+ */
+@Evolving
+public interface Constraint {
+  /**
+   * Returns the name of this constraint.
+   */
+  String name();
+
+  /**
+   * Returns the definition of this constraint in DDL format.
+   */
+  String toDDL();
+
+  /**
+   * Returns the state of this constraint.
+   */
+  ConstraintState state();
+
+  /**
+   * Creates a CHECK constraint with a search condition defined in SQL (Spark 
SQL dialect).
+   *
+   * @param name the constraint name
+   * @param sql the SQL representation of the search condition (Spark SQL 
dialect)
+   * @param state the constraint state
+   * @return a CHECK constraint with the provided configuration
+   */
+  static Check check(String name, String sql, ConstraintState state) {
+return new Check(name, sql, null /* no predicate */, state);
+  }
+
+  /**
+   * Creates a CHECK constraint with a search condition defined by a {@link 
Predicate predicate}.
+   *
+   * @param name the constraint name
+   * @param predicate the search condition
+   * @param state the constraint state
+   * @return a CHECK constraint with the provided configuration
+   */
+  static Check check(String name, Predicate predicate, ConstraintState state) {
+return new Check(name, null /* no SQL */, predicate, state);

Review Comment:
   ditto. `Optional`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51441][SQL] Add DSv2 APIs for constraints [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #50253:
URL: https://github.com/apache/spark/pull/50253#discussion_r1993729490


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/constraints/Constraint.java:
##
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.constraints;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.expressions.NamedReference;
+import org.apache.spark.sql.connector.expressions.filter.Predicate;
+
+/**
+ * A constraint that defines valid states of data in a table.
+ *
+ * @since 4.1.0
+ */
+@Evolving
+public interface Constraint {
+  /**
+   * Returns the name of this constraint.
+   */
+  String name();
+
+  /**
+   * Returns the definition of this constraint in DDL format.
+   */
+  String toDDL();
+
+  /**
+   * Returns the state of this constraint.
+   */
+  ConstraintState state();
+
+  /**
+   * Creates a CHECK constraint with a search condition defined in SQL (Spark 
SQL dialect).
+   *
+   * @param name the constraint name
+   * @param sql the SQL representation of the search condition (Spark SQL 
dialect)
+   * @param state the constraint state
+   * @return a CHECK constraint with the provided configuration
+   */
+  static Check check(String name, String sql, ConstraintState state) {
+return new Check(name, sql, null /* no predicate */, state);

Review Comment:
   Shall we use Java `Optional.empty()` instead of `null`? IIRC, this `null` 
was not a part of SPIP, wan't it? Please correct me if there is a reason to 
have `null`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51350][SQL] Implement Show Procedures [spark]

2025-03-13 Thread via GitHub


cloud-fan commented on code in PR #50109:
URL: https://github.com/apache/spark/pull/50109#discussion_r1993733934


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala:
##
@@ -268,6 +268,11 @@ class InMemoryTableCatalog extends 
BasicInMemoryTableCatalog with SupportsNamesp
 procedure
   }
 
+  override def listProcedures(namespace: Array[String]): Array[Identifier] = {
+procedures.asScala.filter{case (_, p) => 
!p.name().equals("dummy_increment")}

Review Comment:
   oh nvm, I misread the code.
   
   BTW it's scala s we can do `case (_, p) if p.mame != "..."`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51350][SQL] Implement Show Procedures [spark]

2025-03-13 Thread via GitHub


cloud-fan commented on code in PR #50109:
URL: https://github.com/apache/spark/pull/50109#discussion_r1993734437


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala:
##
@@ -268,6 +268,11 @@ class InMemoryTableCatalog extends 
BasicInMemoryTableCatalog with SupportsNamesp
 procedure
   }
 
+  override def listProcedures(namespace: Array[String]): Array[Identifier] = {

Review Comment:
   the `namespace` parameter is not used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51350][SQL] Implement Show Procedures [spark]

2025-03-13 Thread via GitHub


cloud-fan commented on code in PR #50109:
URL: https://github.com/apache/spark/pull/50109#discussion_r1993737628


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ShowProceduresCommand.scala:
##
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.catalyst.analysis.ResolvedNamespace
+import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+import org.apache.spark.sql.types.StringType
+
+/**
+ * A command for users to get procedures.
+ * If a namespace is not given, the current namespace will be used.
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW TABLES [(IN|FROM) namespace]]
+ * }}}
+ */
+case class ShowProceduresCommand(
+child: LogicalPlan,
+override val output: Seq[Attribute] = Seq(
+  AttributeReference("namespace", StringType, nullable = false)(),

Review Comment:
   Shall we follow SHOW FUNCTIONS and only output qualified procedure names?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51350][SQL] Implement Show Procedures [spark]

2025-03-13 Thread via GitHub


cloud-fan commented on code in PR #50109:
URL: https://github.com/apache/spark/pull/50109#discussion_r1993740184


##
sql/core/src/test/scala/org/apache/spark/sql/connector/ProcedureSuite.scala:
##
@@ -40,15 +40,23 @@ class ProcedureSuite extends QueryTest with 
SharedSparkSession with BeforeAndAft
 
   before {
 spark.conf.set(s"spark.sql.catalog.cat", classOf[InMemoryCatalog].getName)
+spark.conf.set(s"spark.sql.catalog.cat2", classOf[InMemoryCatalog].getName)
+
+// needed for switching back and forth between catalogs
+sql("create database cat.default")

Review Comment:
   The default namespace for these catalogs should be `[]`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51441][SQL] Add DSv2 APIs for constraints [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #50253:
URL: https://github.com/apache/spark/pull/50253#discussion_r1993729490


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/constraints/Constraint.java:
##
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.constraints;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.expressions.NamedReference;
+import org.apache.spark.sql.connector.expressions.filter.Predicate;
+
+/**
+ * A constraint that defines valid states of data in a table.
+ *
+ * @since 4.1.0
+ */
+@Evolving
+public interface Constraint {
+  /**
+   * Returns the name of this constraint.
+   */
+  String name();
+
+  /**
+   * Returns the definition of this constraint in DDL format.
+   */
+  String toDDL();
+
+  /**
+   * Returns the state of this constraint.
+   */
+  ConstraintState state();
+
+  /**
+   * Creates a CHECK constraint with a search condition defined in SQL (Spark 
SQL dialect).
+   *
+   * @param name the constraint name
+   * @param sql the SQL representation of the search condition (Spark SQL 
dialect)
+   * @param state the constraint state
+   * @return a CHECK constraint with the provided configuration
+   */
+  static Check check(String name, String sql, ConstraintState state) {
+return new Check(name, sql, null /* no predicate */, state);

Review Comment:
   Shall we use Java `Optional.empty()` instead of `null`? IIRC, this `null` 
was not a part of SPIP, wan't it? Please correct me if there is a reason to 
have `null`. I might miss the detail.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48922][SQL] Avoid redundant array transform of identical expression for map type [spark]

2025-03-13 Thread via GitHub


wForget commented on PR #50245:
URL: https://github.com/apache/spark/pull/50245#issuecomment-2719721278

   > @wForget Could you create backport PR for branch-3.5
   
   Sure, I will create later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51495] Add `Integration Test` GitHub Action job [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #15:
URL: 
https://github.com/apache/spark-connect-swift/pull/15#discussion_r1992468710


##
Tests/SparkConnectTests/DataFrameTests.swift:
##
@@ -81,6 +81,7 @@ struct DataFrameTests {
 await spark.stop()
   }
 
+#if !os(Linux)

Review Comment:
   This is ignored on `Linux` environment due to the `Swift` library 
difference. Will be recovered later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-37019][SQL] Add codegen support to array higher-order functions [spark]

2025-03-13 Thread via GitHub


chris-twiner commented on code in PR #34558:
URL: https://github.com/apache/spark/pull/34558#discussion_r1993925656


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala:
##
@@ -235,6 +256,53 @@ trait HigherOrderFunction extends Expression with 
ExpectsInputTypes {
 val canonicalizedChildren = cleaned.children.map(_.canonicalized)
 withNewChildren(canonicalizedChildren)
   }
+
+
+  protected def assignAtomic(atomicRef: String, value: String, isNull: String 
= FalseLiteral,
+  nullable: Boolean = false) = {
+if (nullable) {
+  s"""
+if ($isNull) {
+  $atomicRef.set(null);
+} else {
+  $atomicRef.set($value);
+}
+  """
+} else {
+  s"$atomicRef.set($value);"
+}
+  }
+
+  protected def assignArrayElement(ctx: CodegenContext, arrayName: String, 
elementCode: ExprCode,
+  elementVar: NamedLambdaVariable, index: String): String = {
+val elementType = elementVar.dataType
+val elementAtomic = ctx.addReferenceObj(elementVar.name, elementVar.value)
+val extractElement = CodeGenerator.getValue(arrayName, elementType, index)

Review Comment:
   > Finally got back around to this, added tests showing why the atomic 
reference is needed in case of `CodegenFallback` inside a lambda expression
   
   FYI - In [my 
testing](https://github.com/sparkutils/quality/blob/temp/fabric_0.1.3.1_rc6_map_with_issues/src/main/scala/org/apache/spark/sql/qualityFunctions/LambdaCompilation.scala#L21)
 you can seemingly, at codegen time, swap out the 
AtomicReference/NamedLambdaVariable in all cases to a simple set/get wrapper 
around a field for the eval CodegenFallback case.
   
   Per [the original NamedLambdaVariable 
PR](https://github.com/apache/spark/pull/21954#discussion_r207162167) the 
AtomicReference only seems to be needed for proper tree transformations, so you 
can swap out at compilation time to squeeze a little bit [more perf 
out](https://brooker.co.za/blog/2012/09/10/volatile.html) (lower memory usage, 
no need for volatile read/memory barriers on write etc. this only really makes 
a difference if the code is HOF heavy of course, like Quality tends to be).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51350][SQL] Implement Show Procedures [spark]

2025-03-13 Thread via GitHub


szehon-ho commented on code in PR #50109:
URL: https://github.com/apache/spark/pull/50109#discussion_r1993927792


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala:
##
@@ -268,6 +268,11 @@ class InMemoryTableCatalog extends 
BasicInMemoryTableCatalog with SupportsNamesp
 procedure
   }
 
+  override def listProcedures(namespace: Array[String]): Array[Identifier] = {
+procedures.asScala.filter{case (_, p) => 
!p.name().equals("dummy_increment")}

Review Comment:
   this is filter, it has to return predicate, do you mean   
```.filter{case (_, p) if !p.name().equals("dummy_increment") => true}```?  it 
looks a bit weird.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51504] Support `select/limit/sort/orderBy/isEmpty` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on PR #16:
URL: 
https://github.com/apache/spark-connect-swift/pull/16#issuecomment-272244

   Could you review this PR when you have some time, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51504] Support `select/limit/sort/orderBy/isEmpty` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


viirya commented on code in PR #16:
URL: 
https://github.com/apache/spark-connect-swift/pull/16#discussion_r1994364613


##
Sources/SparkConnect/DataFrame.swift:
##
@@ -192,4 +192,38 @@ public actor DataFrame: Sendable {
   print(table.render())
 }
   }
+
+  /// Projects a set of expressions and returns a new ``DataFrame``.
+  /// - Parameter cols: Column names
+  /// - Returns: A ``DataFrame`` with subset of columns.
+  public func select(_ cols: String...) -> DataFrame {
+return DataFrame(spark: self.spark, plan: 
SparkConnectClient.getProject(self.plan.root, cols))
+  }
+
+  /// Return a new ``DataFrame`` sorted by the specified column(s).
+  /// - Parameter cols: Column names.
+  /// - Returns: A sorted ``DataFrame``
+  public func sort(_ cols: String...) -> DataFrame {
+return DataFrame(spark: self.spark, plan: 
SparkConnectClient.getSort(self.plan.root, cols))
+  }
+
+  /// <#Description#>
+  /// - Parameter cols: <#cols description#>
+  /// - Returns: <#description#>

Review Comment:
   The API doc looks incorrect?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44856][PYTHON] Improve Python UDTF arrow serializer performance [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on PR #50099:
URL: https://github.com/apache/spark/pull/50099#issuecomment-2721318301

   Could you fix the remaining failures?
   ```
   pyspark.errors.exceptions.base.PySparkRuntimeError: 
[UDTF_ARROW_TYPE_CAST_ERROR] Cannot convert the output value of the column 'x' 
with type 'object' to the specified return type of the column: 'list'. Please check if the data types match and try again.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43221][CORE][3.5] Host local block fetching should use a block status of a block stored on disk [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun closed pull request #50260: [SPARK-43221][CORE][3.5] Host local 
block fetching should use a block status of a block stored on disk
URL: https://github.com/apache/spark/pull/50260


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51450][CORE] BarrierCoordinator thread not exiting in Spark standalone mode [spark]

2025-03-13 Thread via GitHub


beliefer commented on PR #50223:
URL: https://github.com/apache/spark/pull/50223#issuecomment-2721323484

   @jjayadeep06 @srowen @cnauroth Thank you !
   Merged into branch-3.5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51450][CORE] BarrierCoordinator thread not exiting in Spark standalone mode [spark]

2025-03-13 Thread via GitHub


beliefer closed pull request #50223: [SPARK-51450][CORE] BarrierCoordinator 
thread not exiting in Spark standalone mode
URL: https://github.com/apache/spark/pull/50223


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43221][CORE][3.5] Host local block fetching should use a block status of a block stored on disk [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on PR #50260:
URL: https://github.com/apache/spark/pull/50260#issuecomment-2721308974

   Merged to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]

2025-03-13 Thread via GitHub


zhengruifeng commented on PR #50255:
URL: https://github.com/apache/spark/pull/50255#issuecomment-2721794721

   > To @zhengruifeng , could you confirm this part [#50255 
(review)](https://github.com/apache/spark/pull/50255#pullrequestreview-2679800649)
 ?
   > 
   > > this PR seems to affect branch-4.0 Daily CI. Could you confirm that? I'm 
wondering if this PR is targeting 4.0.0 although we don't change spark-rm yet.
   
   For branch-4.0 Daily CI, I checked it with 
https://github.com/apache/spark/pull/50262, I think branch-4.0 is compatiable 
with pyarrow 19.
   
   And the docker image tags are different for branch-4.0 and master.
   
   Build / Python-only (branch-4.0):
   >  /usr/bin/docker --config 
/home/runner/work/_temp/.docker_b5b08f1c-8088-4c21-8300-3944aeeddde3 pull 
ghcr.io/apache/apache-spark-ci-image-pyspark-python-311:branch-4.0-13834271006
   
   Build / Coverage (master, Scala 2.13, Hadoop 3, JDK 17):
   >   /usr/bin/docker --config 
/home/runner/work/_temp/.docker_5b8ff392-d367-4a9b-92c4-3f068331fdee pull 
ghcr.io/apache/apache-spark-ci-image-pyspark-python-311:master-13831993775
   
   So I suspect if we only merge this PR in master, it won't affect branch-4.0.
   
   Regarding other daily builds, can we verify them later? otherwise, need to 
change the env of PR builder to each of these jobs and test one by one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-23890][SQL] Support DDL for adding nested columns to struct types [spark]

2025-03-13 Thread via GitHub


ottomata commented on PR #21012:
URL: https://github.com/apache/spark/pull/21012#issuecomment-2722408376

   > Would one need to use CHANGE|ALTER COLUMN syntax for this?
   
   [TIL](https://phabricator.wikimedia.org/T209453#10632894) that Iceberg 
supports this with .value column name referencing!
   
   https://iceberg.apache.org/docs/nightly/spark-ddl/#alter-table-add-column
   
   ```sql
   -- create a map column of struct key and struct value
   ALTER TABLE prod.db.sample
   ADD COLUMN points map, struct>;
   
   -- add a field to the value struct in a map. Using keyword 'value' to access 
the map's value column.
   ALTER TABLE prod.db.sample
   ADD COLUMN points.value.b int;
   ```
   
   And, that works via Spark!  
   
   I guess there is a bug though:
   https://github.com/apache/iceberg/issues/2962


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51504] Support `select/limit/sort/orderBy/isEmpty` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


viirya commented on code in PR #16:
URL: 
https://github.com/apache/spark-connect-swift/pull/16#discussion_r1994366213


##
Sources/SparkConnect/DataFrame.swift:
##
@@ -192,4 +192,38 @@ public actor DataFrame: Sendable {
   print(table.render())
 }
   }
+
+  /// Projects a set of expressions and returns a new ``DataFrame``.
+  /// - Parameter cols: Column names
+  /// - Returns: A ``DataFrame`` with subset of columns.
+  public func select(_ cols: String...) -> DataFrame {
+return DataFrame(spark: self.spark, plan: 
SparkConnectClient.getProject(self.plan.root, cols))
+  }
+
+  /// Return a new ``DataFrame`` sorted by the specified column(s).
+  /// - Parameter cols: Column names.
+  /// - Returns: A sorted ``DataFrame``
+  public func sort(_ cols: String...) -> DataFrame {
+return DataFrame(spark: self.spark, plan: 
SparkConnectClient.getSort(self.plan.root, cols))
+  }
+
+  /// <#Description#>
+  /// - Parameter cols: <#cols description#>
+  /// - Returns: <#description#>
+  public func orderBy(_ cols: String...) -> DataFrame {
+return DataFrame(spark: self.spark, plan: 
SparkConnectClient.getSort(self.plan.root, cols))
+  }
+
+  /// Limits the result count to the number specified.
+  /// - Parameter n: Number of records to return. Will return this number of 
records or all records if the ``DataFrame`` contains less than this number of 
records.
+  /// - Returns: A subset of the records
+  public func limit(_ n: Int32) -> DataFrame {
+return DataFrame(spark: self.spark, plan: 
SparkConnectClient.getLimit(self.plan.root, n))
+  }
+
+  /// Chec if the ``DataFrame`` is empty and returns a boolean value.

Review Comment:
   ```suggestion
 /// Checks if the ``DataFrame`` is empty and returns a boolean value.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51504] Support `select/limit/sort/orderBy/isEmpty` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


viirya commented on code in PR #16:
URL: 
https://github.com/apache/spark-connect-swift/pull/16#discussion_r1994368629


##
Tests/SparkConnectTests/DataFrameTests.swift:
##
@@ -68,6 +68,76 @@ struct DataFrameTests {
 await spark.stop()
   }
 
+  @Test
+  func selectNone() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let emptySchema = try await spark.range(1).select().schema()
+#expect(emptySchema == #"{"struct":{}}"#)
+await spark.stop()
+  }
+
+  @Test
+  func select() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let schema = try await spark.range(1).select("id").schema()
+#expect(
+  schema
+== #"{"struct":{"fields":[{"name":"id","dataType":{"long":{}}}]}}"#
+)
+await spark.stop()
+  }
+
+  @Test
+  func selectMultipleColumns() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let schema = try await spark.sql("SELECT * FROM VALUES (1, 
2)").select("col2", "col1").schema()
+#expect(
+  schema
+== 
#"{"struct":{"fields":[{"name":"col2","dataType":{"integer":{}}},{"name":"col1","dataType":{"integer":{}}}]}}"#
+)
+await spark.stop()
+  }
+
+  @Test
+  func selectInvalidColumn() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+try await #require(throws: Error.self) {
+  let _ = try await spark.range(1).select("invalid").schema()
+}
+await spark.stop()
+  }
+
+  @Test
+  func limit() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+#expect(try await spark.range(10).limit(0).count() == 0)
+#expect(try await spark.range(10).limit(1).count() == 1)
+#expect(try await spark.range(10).limit(2).count() == 2)
+await spark.stop()
+  }
+
+  @Test
+  func isEmpty() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+#expect(try await spark.range(0).isEmpty())
+#expect(!(try await spark.range(1).isEmpty()))
+await spark.stop()
+  }
+
+  @Test
+  func sort() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+#expect(try await spark.range(10).sort("id").count() == 10)

Review Comment:
   Do we need to check if the return data is actually sorted?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51504] Support `select/limit/sort/orderBy/isEmpty` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


viirya commented on code in PR #16:
URL: 
https://github.com/apache/spark-connect-swift/pull/16#discussion_r1994367984


##
Tests/SparkConnectTests/DataFrameTests.swift:
##
@@ -68,6 +68,76 @@ struct DataFrameTests {
 await spark.stop()
   }
 
+  @Test
+  func selectNone() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let emptySchema = try await spark.range(1).select().schema()
+#expect(emptySchema == #"{"struct":{}}"#)
+await spark.stop()
+  }
+
+  @Test
+  func select() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let schema = try await spark.range(1).select("id").schema()
+#expect(
+  schema
+== #"{"struct":{"fields":[{"name":"id","dataType":{"long":{}}}]}}"#
+)
+await spark.stop()
+  }
+
+  @Test
+  func selectMultipleColumns() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let schema = try await spark.sql("SELECT * FROM VALUES (1, 
2)").select("col2", "col1").schema()
+#expect(
+  schema
+== 
#"{"struct":{"fields":[{"name":"col2","dataType":{"integer":{}}},{"name":"col1","dataType":{"integer":{}}}]}}"#
+)
+await spark.stop()
+  }
+
+  @Test
+  func selectInvalidColumn() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+try await #require(throws: Error.self) {
+  let _ = try await spark.range(1).select("invalid").schema()
+}
+await spark.stop()
+  }
+
+  @Test
+  func limit() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+#expect(try await spark.range(10).limit(0).count() == 0)
+#expect(try await spark.range(10).limit(1).count() == 1)
+#expect(try await spark.range(10).limit(2).count() == 2)

Review Comment:
   ```suggestion
   #expect(try await spark.range(10).limit(2).count() == 2)
   #expect(try await spark.range(10).limit(15).count() == 10)
   ```



##
Tests/SparkConnectTests/DataFrameTests.swift:
##
@@ -68,6 +68,76 @@ struct DataFrameTests {
 await spark.stop()
   }
 
+  @Test
+  func selectNone() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let emptySchema = try await spark.range(1).select().schema()
+#expect(emptySchema == #"{"struct":{}}"#)
+await spark.stop()
+  }
+
+  @Test
+  func select() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let schema = try await spark.range(1).select("id").schema()
+#expect(
+  schema
+== #"{"struct":{"fields":[{"name":"id","dataType":{"long":{}}}]}}"#
+)
+await spark.stop()
+  }
+
+  @Test
+  func selectMultipleColumns() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let schema = try await spark.sql("SELECT * FROM VALUES (1, 
2)").select("col2", "col1").schema()
+#expect(
+  schema
+== 
#"{"struct":{"fields":[{"name":"col2","dataType":{"integer":{}}},{"name":"col1","dataType":{"integer":{}}}]}}"#
+)
+await spark.stop()
+  }
+
+  @Test
+  func selectInvalidColumn() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+try await #require(throws: Error.self) {
+  let _ = try await spark.range(1).select("invalid").schema()
+}
+await spark.stop()
+  }
+
+  @Test
+  func limit() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+#expect(try await spark.range(10).limit(0).count() == 0)
+#expect(try await spark.range(10).limit(1).count() == 1)
+#expect(try await spark.range(10).limit(2).count() == 2)

Review Comment:
   Maybe add a case exceeding the limit number?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51504] Support `select/limit/sort/orderBy/isEmpty` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


viirya commented on code in PR #16:
URL: 
https://github.com/apache/spark-connect-swift/pull/16#discussion_r1994370384


##
Tests/SparkConnectTests/DataFrameTests.swift:
##
@@ -68,6 +68,76 @@ struct DataFrameTests {
 await spark.stop()
   }
 
+  @Test
+  func selectNone() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let emptySchema = try await spark.range(1).select().schema()
+#expect(emptySchema == #"{"struct":{}}"#)
+await spark.stop()
+  }
+
+  @Test
+  func select() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let schema = try await spark.range(1).select("id").schema()
+#expect(
+  schema
+== #"{"struct":{"fields":[{"name":"id","dataType":{"long":{}}}]}}"#
+)
+await spark.stop()
+  }
+
+  @Test
+  func selectMultipleColumns() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let schema = try await spark.sql("SELECT * FROM VALUES (1, 
2)").select("col2", "col1").schema()
+#expect(
+  schema
+== 
#"{"struct":{"fields":[{"name":"col2","dataType":{"integer":{}}},{"name":"col1","dataType":{"integer":{}}}]}}"#
+)
+await spark.stop()
+  }
+
+  @Test
+  func selectInvalidColumn() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+try await #require(throws: Error.self) {
+  let _ = try await spark.range(1).select("invalid").schema()
+}
+await spark.stop()
+  }
+
+  @Test
+  func limit() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+#expect(try await spark.range(10).limit(0).count() == 0)
+#expect(try await spark.range(10).limit(1).count() == 1)
+#expect(try await spark.range(10).limit(2).count() == 2)
+await spark.stop()
+  }
+
+  @Test
+  func isEmpty() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+#expect(try await spark.range(0).isEmpty())
+#expect(!(try await spark.range(1).isEmpty()))
+await spark.stop()
+  }
+
+  @Test
+  func sort() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+#expect(try await spark.range(10).sort("id").count() == 10)

Review Comment:
   Okay, I see. Maybe we can return to this after `collect` is implemented.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51504] Support `select/limit/sort/orderBy/isEmpty` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #16:
URL: 
https://github.com/apache/spark-connect-swift/pull/16#discussion_r1994369781


##
Tests/SparkConnectTests/DataFrameTests.swift:
##
@@ -68,6 +68,76 @@ struct DataFrameTests {
 await spark.stop()
   }
 
+  @Test
+  func selectNone() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let emptySchema = try await spark.range(1).select().schema()
+#expect(emptySchema == #"{"struct":{}}"#)
+await spark.stop()
+  }
+
+  @Test
+  func select() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let schema = try await spark.range(1).select("id").schema()
+#expect(
+  schema
+== #"{"struct":{"fields":[{"name":"id","dataType":{"long":{}}}]}}"#
+)
+await spark.stop()
+  }
+
+  @Test
+  func selectMultipleColumns() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+let schema = try await spark.sql("SELECT * FROM VALUES (1, 
2)").select("col2", "col1").schema()
+#expect(
+  schema
+== 
#"{"struct":{"fields":[{"name":"col2","dataType":{"integer":{}}},{"name":"col1","dataType":{"integer":{}}}]}}"#
+)
+await spark.stop()
+  }
+
+  @Test
+  func selectInvalidColumn() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+try await #require(throws: Error.self) {
+  let _ = try await spark.range(1).select("invalid").schema()
+}
+await spark.stop()
+  }
+
+  @Test
+  func limit() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+#expect(try await spark.range(10).limit(0).count() == 0)
+#expect(try await spark.range(10).limit(1).count() == 1)
+#expect(try await spark.range(10).limit(2).count() == 2)
+await spark.stop()
+  }
+
+  @Test
+  func isEmpty() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+#expect(try await spark.range(0).isEmpty())
+#expect(!(try await spark.range(1).isEmpty()))
+await spark.stop()
+  }
+
+  @Test
+  func sort() async throws {
+let spark = try await SparkSession.builder.getOrCreate()
+#expect(try await spark.range(10).sort("id").count() == 10)

Review Comment:
   So, I only manually verified the result so far.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51504] Support `select/limit/sort/orderBy/isEmpty` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on PR #16:
URL: 
https://github.com/apache/spark-connect-swift/pull/16#issuecomment-2722813101

   Merged to main.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51506][PYTHON][SS] Do not enforce users to implement close() in TransformWithStateInPandas [spark]

2025-03-13 Thread via GitHub


jingz-db commented on PR #50272:
URL: https://github.com/apache/spark/pull/50272#issuecomment-2722832096

   There seem to be more places that could remove the unnecessary `close()` 
impl in test file: e.g. 
https://github.com/apache/spark/blob/ccfc0a9dcea8fa9d6aab4dbb233f8135a3947441/python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py#L1479


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51501][SQL] Disable ObjectHashAggregate for group by on collated columns [spark]

2025-03-13 Thread via GitHub


stefankandic opened a new pull request, #50267:
URL: https://github.com/apache/spark/pull/50267

   ### What changes were proposed in this pull request?
   Disabling `ObjectHashAggregate` when grouping on columns with collations.
   
   
   ### Why are the changes needed?
   https://github.com/apache/spark/pull/45290 added support for sort based 
aggregation on collated columns and explicitly forbade the use of hash 
aggregate for collated columns. However, it did not consider the third type of 
aggregate, the object hash aggregate, which is only used when there are also 
TypedImperativeAggregate expressions present 
([source](https://github.com/apache/spark/blob/f3b081066393e1568c364b6d3bc0bceabd1e7e9f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L1204)).
   
   That means that if we group by a collated column and also have a 
TypedImperativeAggregate we will end up using the object has aggregate which 
can lead to incorrect results like in the example below:
   
   ```code
   CREATE TABLE tbl(c1 STRING COLLATE UTF8_LCASE, c2 INT) USING PARQUET;
   INSERT INTO tbl VALUES ('HELLO', 1), ('hello', 2), ('HeLlO', 3);
   SELECT COLLECT_LIST(c2) as list FROM tbl GROUP BY c1;
   ```
   where the result would have three rows with values [1], [2] and [3] instead 
of one row with value [1, 2, 3].
   
   For this reason we should do the same thing as we did for the regular hash 
aggregate, make it so that it doesn't support grouping expressions on collated 
columns.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   New unit tests.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51502][SQL] Move collations test to collations package [spark]

2025-03-13 Thread via GitHub


stefankandic opened a new pull request, #50268:
URL: https://github.com/apache/spark/pull/50268

   
   
   ### What changes were proposed in this pull request?
   Move collations test into collations package where most collation test 
suites already are located.
   
   ### Why are the changes needed?
   For consistency purposes, also not to bloat the sql folder with too many 
suites.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Existing tests should suffice.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-37019][SQL] Add codegen support to array higher-order functions [spark]

2025-03-13 Thread via GitHub


Kimahriman commented on code in PR #34558:
URL: https://github.com/apache/spark/pull/34558#discussion_r1993969482


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala:
##
@@ -235,6 +256,53 @@ trait HigherOrderFunction extends Expression with 
ExpectsInputTypes {
 val canonicalizedChildren = cleaned.children.map(_.canonicalized)
 withNewChildren(canonicalizedChildren)
   }
+
+
+  protected def assignAtomic(atomicRef: String, value: String, isNull: String 
= FalseLiteral,
+  nullable: Boolean = false) = {
+if (nullable) {
+  s"""
+if ($isNull) {
+  $atomicRef.set(null);
+} else {
+  $atomicRef.set($value);
+}
+  """
+} else {
+  s"$atomicRef.set($value);"
+}
+  }
+
+  protected def assignArrayElement(ctx: CodegenContext, arrayName: String, 
elementCode: ExprCode,
+  elementVar: NamedLambdaVariable, index: String): String = {
+val elementType = elementVar.dataType
+val elementAtomic = ctx.addReferenceObj(elementVar.name, elementVar.value)
+val extractElement = CodeGenerator.getValue(arrayName, elementType, index)

Review Comment:
   Do you just mean the built-in concurrency of `AtomicReference` isn't needed 
except for something related to the the `transform` operations, and otherwise 
you could just use a regular variable to store the result? 
   
   Agree that might be preferable, but probably out of scope for this initial 
codegen implementation since that would change how the interpreted path works 
as well
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51270] Support nanosecond precision timestamp in Variant [spark]

2025-03-13 Thread via GitHub


cashmand opened a new pull request, #50270:
URL: https://github.com/apache/spark/pull/50270

   
   
   ### What changes were proposed in this pull request?
   
   Adds Variant support for the nanosecond precision timestamp types in 
https://github.com/apache/parquet-format/commit/25f05e73d8cd7f5c83532ce51cb4f4de8ba5f2a2,
 similar to the UUID support added in 
https://github.com/apache/spark/pull/50025.
   
   Cast support: the PR allows casting to any type that the corresponding 
microsecond value supports. For strings, we retain the nanosecond precision, 
and for all other casts, we truncate to a microsecond timestamp and then cast. 
An alternative is to round to the nearest microsecond, but truncation is 
consistent with our current timestamp-to-integer cast, which truncates to the 
nearest second. We could also consider disabling some of these casts, but I 
didn't see a strong argument to.
   
   SchemaOfVariant: The types are reported as `timestamp_nanos` and 
`timestamp_nanos_ntz`. If Spark eventually adds native support for these types, 
we would either need to use these names, or change the behaviour of 
SchemaOfVariant.
   
   ### Why are the changes needed?
   
   Support Variant values written by other tools.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, operations on Variant binaries with these types would have previously 
failed, and will now succeed.
   
   ### How was this patch tested?
   
   Unit tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51414][SQL] Add the make_time() function [spark]

2025-03-13 Thread via GitHub


robreeves opened a new pull request, #50269:
URL: https://github.com/apache/spark/pull/50269

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51441][SQL] Add DSv2 APIs for constraints [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #50253:
URL: https://github.com/apache/spark/pull/50253#discussion_r1993739512


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/constraints/ForeignKey.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog.constraints;
+
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.expressions.NamedReference;
+
+import java.util.Arrays;
+import java.util.Objects;

Review Comment:
   ditto. `import` ordering.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-37019][SQL] Add codegen support to array higher-order functions [spark]

2025-03-13 Thread via GitHub


Kimahriman commented on PR #34558:
URL: https://github.com/apache/spark/pull/34558#issuecomment-2722033268

   > > @Kimahriman just out of curiosity, how much did the performance improve?
   > 
   > I just wanted to add to the above response that I've implemented a 
compilation scheme 
[here](https://sparkutils.github.io/quality/latest/advanced/userFunctions/#controlling-compilation-tweaking-the-quality-optimisations),
 as part of Quality, and we saw perf boosts of up to 40%, after that adding 
further lambdas triggered the cost of code generation being higher than the 
saving. It's definitely usage dependant though, the more work done in the 
function the higher the cost (and therefore potential saving by compilation), a 
small boost is noticeable on removal of the atomic under similar ideal 
circumstances.
   > 
   > edit - [the 
source](https://github.com/sparkutils/quality/blob/main/src/main/scala/org/apache/spark/sql/qualityFunctions/LambdaCompilation.scala)
   
   Not sure how I missed this comment. we haven't done extensive performance 
comparisons with and without this, we've just been using it for a few years 
now. It's hard to quantify the impact since it's completely dependent on the 
expressions run inside the functions. But that's also the whole point, by 
enabling codegen for HOFs you enable codegen for expressions inside the lambda 
functions, which are assumed to be more performant since that's the whole point 
of codegen.
   
   Additionally this enables a follow on I'm currently working on which is 
enabling subexpression elimination inside of lambda functions, which we've 
recently identified as a major performance killer for us, as it's very easy to 
generate a lot of duplicate expression evaluations in certain cases


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-37019][SQL] Add codegen support to array higher-order functions [spark]

2025-03-13 Thread via GitHub


chris-twiner commented on PR #34558:
URL: https://github.com/apache/spark/pull/34558#issuecomment-2722268943

   > > > @Kimahriman just out of curiosity, how much did the performance 
improve?
   > > 
   > > 
   > > I just wanted to add to the above response that I've implemented a 
compilation scheme 
[here](https://sparkutils.github.io/quality/latest/advanced/userFunctions/#controlling-compilation-tweaking-the-quality-optimisations),
 as part of Quality, and we saw perf boosts of up to 40%, after that adding 
further lambdas triggered the cost of code generation being higher than the 
saving. It's definitely usage dependant though, the more work done in the 
function the higher the cost (and therefore potential saving by compilation), a 
small boost is noticeable on removal of the atomic under similar ideal 
circumstances.
   > > edit - [the 
source](https://github.com/sparkutils/quality/blob/main/src/main/scala/org/apache/spark/sql/qualityFunctions/LambdaCompilation.scala)
   > 
   > Not sure how I missed this comment. we haven't done extensive performance 
comparisons with and without this, we've just been using it for a few years 
now. It's hard to quantify the impact since it's completely dependent on the 
expressions run inside the functions. But that's also the whole point, by 
enabling codegen for HOFs you enable codegen for expressions inside the lambda 
functions, which are assumed to be more performant since that's the whole point 
of codegen.
   > 
   > Additionally this enables a follow on I'm currently working on which is 
enabling subexpression elimination inside of lambda functions, which we've 
recently identified as a major performance killer for us, as it's very easy to 
generate a lot of duplicate expression evaluations in certain cases
   
   The subexpression elimination option is huge!  Very exciting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49479][CORE] Cancel the Timer non-daemon thread on stopping the BarrierCoordinator [spark]

2025-03-13 Thread via GitHub


tedyu commented on code in PR #50020:
URL: https://github.com/apache/spark/pull/50020#discussion_r1993452621


##
core/src/main/scala/org/apache/spark/BarrierCoordinator.scala:
##
@@ -134,11 +138,8 @@ private[spark] class BarrierCoordinator(
 
 // Cancel the current active TimerTask and release resources.
 private def cancelTimerTask(): Unit = {
-  if (timerTask != null) {
-timerTask.cancel()
-timer.purge()
-timerTask = null
-  }
+  timerFutures.asScala.foreach(_.cancel(false))

Review Comment:
   Thanks for your comment.
   I think merging the removal of futures with cancellation would make the code 
amenable to future code changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51504] Support `DataFrame.select` [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun opened a new pull request, #16:
URL: https://github.com/apache/spark-connect-swift/pull/16

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51489][SQL] Represent SQL Script in Spark UI [spark]

2025-03-13 Thread via GitHub


dusantism-db closed pull request #50256: [SPARK-51489][SQL] Represent SQL 
Script in Spark UI
URL: https://github.com/apache/spark/pull/50256


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-37019][SQL] Add codegen support to array higher-order functions [spark]

2025-03-13 Thread via GitHub


chris-twiner commented on code in PR #34558:
URL: https://github.com/apache/spark/pull/34558#discussion_r1994100057


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala:
##
@@ -235,6 +256,53 @@ trait HigherOrderFunction extends Expression with 
ExpectsInputTypes {
 val canonicalizedChildren = cleaned.children.map(_.canonicalized)
 withNewChildren(canonicalizedChildren)
   }
+
+
+  protected def assignAtomic(atomicRef: String, value: String, isNull: String 
= FalseLiteral,
+  nullable: Boolean = false) = {
+if (nullable) {
+  s"""
+if ($isNull) {
+  $atomicRef.set(null);
+} else {
+  $atomicRef.set($value);
+}
+  """
+} else {
+  s"$atomicRef.set($value);"
+}
+  }
+
+  protected def assignArrayElement(ctx: CodegenContext, arrayName: String, 
elementCode: ExprCode,
+  elementVar: NamedLambdaVariable, index: String): String = {
+val elementType = elementVar.dataType
+val elementAtomic = ctx.addReferenceObj(elementVar.name, elementVar.value)
+val extractElement = CodeGenerator.getValue(arrayName, elementType, index)

Review Comment:
   Yes, that's definitely all it is, I may spend time to find a reproducible 
case of this but it's not in the hof test suite.
   
   It makes sense to keep the two things separate. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51505][SQL] Log empty partition number metrics in AQE coalesce [spark]

2025-03-13 Thread via GitHub


liuzqt commented on PR #50273:
URL: https://github.com/apache/spark/pull/50273#issuecomment-2722992293

   @maryannxue @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-xxxx][Collation] Guard collation regex expressions behind a flag [spark]

2025-03-13 Thread via GitHub


github-actions[bot] closed pull request #49026: [WIP][SPARK-][Collation] 
Guard collation regex expressions behind a flag
URL: https://github.com/apache/spark/pull/49026


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50404][PYTHON] PySpark DataFrame Pipe Method [spark]

2025-03-13 Thread via GitHub


github-actions[bot] commented on PR #48947:
URL: https://github.com/apache/spark/pull/48947#issuecomment-2722995058

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50424][INFRA] Extract the common content of `Dockerfile` from `Docs`, `Linter`, and `SparkR` test images [spark]

2025-03-13 Thread via GitHub


github-actions[bot] closed pull request #48967: [SPARK-50424][INFRA] Extract 
the common content of `Dockerfile` from `Docs`, `Linter`, and `SparkR` test 
images
URL: https://github.com/apache/spark/pull/48967


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-XXXX][Avro] Fix avro deserialization breaking for UnionType[null, Record] [spark]

2025-03-13 Thread via GitHub


github-actions[bot] closed pull request #49019: [SPARK-][Avro] Fix avro 
deserialization breaking for UnionType[null, Record]
URL: https://github.com/apache/spark/pull/49019


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51506][PYTHON][SS] Do not enforce users to implement close() in TransformWithStateInPandas [spark]

2025-03-13 Thread via GitHub


HeartSaVioR commented on PR #50272:
URL: https://github.com/apache/spark/pull/50272#issuecomment-2723004044

   It's better to retain both types of implementations since we want to 
"optionally" have close() so we should have covered both cases on testing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51505][SQL] Log empty partition number metrics in AQE coalesce [spark]

2025-03-13 Thread via GitHub


liuzqt opened a new pull request, #50273:
URL: https://github.com/apache/spark/pull/50273

   ### What changes were proposed in this pull request?
   
   Log empty partition number metrics in AQE coalesce
   
   
   ### Why are the changes needed?
   
   There're cases where shuffle is highly skewed and many partitions are empty 
(probably due to small NDV) and only a few partitons contains data and are very 
large. AQE coalesce here basically did nothing by eliminate the empty 
partitions. AQE coalesce metrics might look confusing and user might think it 
wrongly coalesce to large partitions.
   
   We'd better log empty partition number in the metrics.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Added `numEmptyPartitions` metrics to AQEShuffleReadExec when there is 
coalescing.
   
   
   ### How was this patch tested?
   
   New test case.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   NO
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51506][PYTHON][SS] Do not enforce users to implement close() in TransformWithStateInPandas [spark]

2025-03-13 Thread via GitHub


HeartSaVioR commented on PR #50272:
URL: https://github.com/apache/spark/pull/50272#issuecomment-2723005968

   Thanks! Merging to master/4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51506][PYTHON][SS] Do not enforce users to implement close() in TransformWithStateInPandas [spark]

2025-03-13 Thread via GitHub


HeartSaVioR commented on PR #50272:
URL: https://github.com/apache/spark/pull/50272#issuecomment-2723008273

   Forgot to tell, I have got approval on merging this bugfix from @cloud-fan 
beforehand.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51506][PYTHON][SS] Do not enforce users to implement close() in TransformWithStateInPandas [spark]

2025-03-13 Thread via GitHub


HeartSaVioR closed pull request #50272: [SPARK-51506][PYTHON][SS] Do not 
enforce users to implement close() in TransformWithStateInPandas
URL: https://github.com/apache/spark/pull/50272


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

2025-03-13 Thread via GitHub


attilapiros commented on code in PR #50033:
URL: https://github.com/apache/spark/pull/50033#discussion_r1994459647


##
core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala:
##
@@ -90,8 +90,11 @@ private[spark] class ShuffleMapStage(
 
   /** Returns the sequence of partition ids that are missing (i.e. needs to be 
computed). */
   override def findMissingPartitions(): Seq[Int] = {
-mapOutputTrackerMaster
-  .findMissingPartitions(shuffleDep.shuffleId)
-  .getOrElse(0 until numPartitions)
+if (this.areAllPartitionsMissing(this.latestInfo.attemptNumber())) {

Review Comment:
   I think it would be much better to unregister all the map and merge output  
at the rollback of the stages (for each rolled back stage) as it is done 
currently for the barrier RDDs:
   
https://github.com/apache/spark/blob/1014c6babf57d08ba66bba3706bd350f4e9277dd/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2116
   
   That way `areAllPartitionsMissing` and `attemptIdAllPartitionsMissing` won't 
be needed as `mapOutputTrackerMaster.findMissingPartitions` will give back all 
the partitions. 
   
   And for the `ResultStage` I wonder whether an `abortStage` would be 
sufficient.
   



##
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala:
##
@@ -167,7 +167,10 @@ abstract class BaseYarnClusterSuite extends SparkFunSuite 
with Matchers {
   extraJars: Seq[String] = Nil,
   extraConf: Map[String, String] = Map(),
   extraEnv: Map[String, String] = Map(),
-  outFile: Option[File] = None): SparkAppHandle.State = {
+  outFile: Option[File] = None,
+  testTimeOut: Int = 3, // minutes
+  timeOutIntervalCheck: Int = 1 // seconds

Review Comment:
   Why `Int` what about `Duration`?



##
resource-managers/yarn/pom.xml:
##
@@ -37,6 +37,11 @@
   spark-core_${scala.binary.version}
   ${project.version}
 
+
+  org.apache.spark
+  spark-sql_${scala.binary.version}
+  ${project.version}
+

Review Comment:
   Please do not depend on `spark-sql` in the `spark-yarn`. Especially not in 
compile scope.
   I would prefer to have the test chnaged to use only RDDs.  



##
scalastyle-config.xml:
##
@@ -94,7 +94,7 @@ This file is divided into 3 sections:
   
 
   
-
+

Review Comment:
   Please revert this as this decision has effect to all the future source 
codes which requires a bigger discussion.
   
   I suggest to use:
   ```
   // scalastyle:off argcount
   // scalastyle:on
   ```
   
   Or case class containing both values as they are related.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51497][SQL] Add the default time formatter [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun closed pull request #50266: [SPARK-51497][SQL] Add the default 
time formatter
URL: https://github.com/apache/spark/pull/50266


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51506][PYTHON][SS] Do not enforce users to implement close() in TransformWithStateInPandas [spark]

2025-03-13 Thread via GitHub


bogao007 commented on PR #50272:
URL: https://github.com/apache/spark/pull/50272#issuecomment-2722873069

   > nits: There seem to be more places that could remove the unnecessary 
`close()` impl in test file: e.g.
   > 
   > 
https://github.com/apache/spark/blob/ccfc0a9dcea8fa9d6aab4dbb233f8135a3947441/python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py#L1479
   
   Yeah I tend to keep both in the test. Removing some `close()` should be able 
to verify. Do you think we should remove all the `close()` instead?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51491][PYTHON] Simplify boxplot with subquery APIs [spark]

2025-03-13 Thread via GitHub


zhengruifeng closed pull request #50258: [SPARK-51491][PYTHON] Simplify boxplot 
with subquery APIs
URL: https://github.com/apache/spark/pull/50258


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51191][SQL] Validate default values handling in DELETE, UPDATE, MERGE [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #50271:
URL: https://github.com/apache/spark/pull/50271#discussion_r1994422636


##
sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala:
##
@@ -32,6 +32,58 @@ abstract class MergeIntoTableSuiteBase extends 
RowLevelOperationSuiteBase {
 
   import testImplicits._
 
+  test("merge into table containing added column with default value") {
+withTempView("source") {
+  sql(
+s"""CREATE TABLE $tableNameAsString (
+   | pk INT NOT NULL,
+   | salary INT NOT NULL DEFAULT -1,
+   | dep STRING)
+   |PARTITIONED BY (dep)
+   |""".stripMargin)
+
+  append("pk INT NOT NULL, dep STRING",
+"""{ "pk": 1, "dep": "hr" }
+  |{ "pk": 2, "dep": "hr" }
+  |{ "pk": 3, "dep": "hr" }
+  |""".stripMargin)
+
+  sql(s"ALTER TABLE $tableNameAsString ADD COLUMN txt STRING DEFAULT 
'initial-text'")
+
+  checkAnswer(
+sql(s"SELECT * FROM $tableNameAsString"),
+Seq(
+  Row(1, -1, "hr", "initial-text"),
+  Row(2, -1, "hr", "initial-text"),
+  Row(3, -1, "hr", "initial-text")))
+
+  val sourceRows = Seq(
+(1, 100, "hr"),
+(4, 400, "hr"))
+  sourceRows.toDF("pk", "salary", "dep").createOrReplaceTempView("source")
+
+  sql(
+s"""MERGE INTO $tableNameAsString t
+   |USING source s
+   |ON t.pk = s.pk
+   |WHEN MATCHED THEN
+   | UPDATE SET t.salary = s.salary, t.txt = DEFAULT

Review Comment:
   Can we remove `t.txt = DEFAULT` part in this line? Or, can we have a 
different value instead of `DEFAULT`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51191][SQL] Validate default values handling in DELETE, UPDATE, MERGE [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on PR #50271:
URL: https://github.com/apache/spark/pull/50271#issuecomment-2722908307

   cc @huaxingao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51491][PYTHON] Simplify boxplot with subquery APIs [spark]

2025-03-13 Thread via GitHub


zhengruifeng commented on PR #50258:
URL: https://github.com/apache/spark/pull/50258#issuecomment-2722879036

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

2025-03-13 Thread via GitHub


ahshahid commented on code in PR #50033:
URL: https://github.com/apache/spark/pull/50033#discussion_r1994488768


##
scalastyle-config.xml:
##
@@ -94,7 +94,7 @@ This file is divided into 3 sections:
   
 
   
-
+

Review Comment:
   Ok. will do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

2025-03-13 Thread via GitHub


ahshahid commented on code in PR #50033:
URL: https://github.com/apache/spark/pull/50033#discussion_r1994488494


##
core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala:
##
@@ -90,8 +90,11 @@ private[spark] class ShuffleMapStage(
 
   /** Returns the sequence of partition ids that are missing (i.e. needs to be 
computed). */
   override def findMissingPartitions(): Seq[Int] = {
-mapOutputTrackerMaster
-  .findMissingPartitions(shuffleDep.shuffleId)
-  .getOrElse(0 until numPartitions)
+if (this.areAllPartitionsMissing(this.latestInfo.attemptNumber())) {

Review Comment:
   For Map , I think what you are saying sounds great. I dont know much about 
the Barrier RDD logic. But unregistering the map outputs  makes sense..
   For ResultStage, not sure what you mean by abort stage .. That will throw an 
Exception , right?
   But that is not what we should do , if the FetchFailures is happening for 
the first partition of the ResultStagel



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51191][SQL] Validate default values handling in DELETE, UPDATE, MERGE [spark]

2025-03-13 Thread via GitHub


aokolnychyi opened a new pull request, #50271:
URL: https://github.com/apache/spark/pull/50271

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR adds tests for default values handling in DELETE, UPDATE, MERGE.
   
   ### Why are the changes needed?
   
   
   These tests are needed to ensure default values are handled correctly in all 
DML operations.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No.
   
   
   ### How was this patch tested?
   
   
   This PR consists of tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

2025-03-13 Thread via GitHub


ahshahid commented on code in PR #50033:
URL: https://github.com/apache/spark/pull/50033#discussion_r1994537557


##
core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala:
##
@@ -90,8 +90,11 @@ private[spark] class ShuffleMapStage(
 
   /** Returns the sequence of partition ids that are missing (i.e. needs to be 
computed). */
   override def findMissingPartitions(): Seq[Int] = {
-mapOutputTrackerMaster
-  .findMissingPartitions(shuffleDep.shuffleId)
-  .getOrElse(0 until numPartitions)
+if (this.areAllPartitionsMissing(this.latestInfo.attemptNumber())) {

Review Comment:
   Are you sure that would work even for ShuffleMap ?
   Coz even if we unregister, but before the attempt ID is increased, a 
successful partition completion task comes, what would happen in that case? 
Will it get ignored because of unregistering?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51372][SQL] Introduce a builder pattern in TableCatalog [spark]

2025-03-13 Thread via GitHub


beliefer commented on code in PR #50137:
URL: https://github.com/apache/spark/pull/50137#discussion_r1994546972


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableBuilderImpl.java:
##
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog;
+
+import com.google.common.collect.Maps;
+import org.apache.spark.sql.catalyst.analysis.NoSuchNamespaceException;
+import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException;
+import org.apache.spark.sql.connector.expressions.Transform;
+
+import java.util.Map;
+
+/**
+ * Default implementation of {@link TableCatalog.TableBuilder}.
+ */
+public class TableBuilderImpl implements TableCatalog.TableBuilder {
+  /** Catalog where table needs to be created. */
+  private final TableCatalog catalog;
+  /** Table identifier. */
+  private final Identifier identifier;
+  /** Columns of the new table. */
+  private final Column[] columns;
+  /** Table properties. */
+  private final Map properties = Maps.newHashMap();
+  /** Transforms to use for partitioning data in the table. */
+  private Transform[] partitions = new Transform[0];
+
+  /**
+   * Constructor for TableBuilderImpl.
+   *
+   * @param catalog catalog where table needs to be created.
+   * @param identifier identifier for the table.
+   * @param columns the columns of the new table.
+   */
+  public TableBuilderImpl(TableCatalog catalog,
+  Identifier identifier,
+  Column[] columns) {
+this.catalog = catalog;
+this.identifier = identifier;
+this.columns = columns;
+  }
+
+  @Override
+  public TableCatalog.TableBuilder withPartitions(Transform[] partitions) {
+this.partitions = partitions;
+return this;
+  }
+
+  @Override
+  public TableCatalog.TableBuilder withProperties(Map 
properties) {
+this.properties.putAll(properties);
+return this;
+  }
+
+  @Override
+  public Table create() throws TableAlreadyExistsException, 
NoSuchNamespaceException {
+return catalog.createTable(identifier, columns, partitions, properties);

Review Comment:
   I have a question. How to increase flexibility if users customize the 
`TableBuilder` and override the `create`.
   It seems there is no flexible way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50806][SQL] Support InputRDDCodegen interruption on task cancellation [spark]

2025-03-13 Thread via GitHub


Ngone51 commented on PR #49501:
URL: https://github.com/apache/spark/pull/49501#issuecomment-2723111433

   @cloud-fan @dongjoon-hyun I have updated the PR. Could you take another 
look? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51507] Support creating config map that can be mounted by Spark pods for apps [spark-kubernetes-operator]

2025-03-13 Thread via GitHub


jiangzho opened a new pull request, #166:
URL: https://github.com/apache/spark-kubernetes-operator/pull/166

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR introduces a new field `configMapSpecs`, which enables user to 
create configmap(s) which can be later mounted on driver and/or executors with 
values specified in SparkApplications. It also adds corresponding docs and unit 
tests.
   
   ### Why are the changes needed?
   
   
   There's a common user demand for the feature to create lightweight values / 
files override using configmap. This new feature saves the trouble for user to 
create configmap(s) beforehand - the config map would share the lifecycle of 
the corresponding Spark app.
   
   User may mount the ConfigMap as needed in driver and executor pod templates 
- one example is provided in the doc. In future we may further enhance this 
flow by mounting the created config map automatically to Spark pods. This can 
be achieved by supporting volume type of ConfigMap in Spark Kubernetes, or by 
adding mutating webhook & alter pod templates after creation. For now, this PR 
opens up the possibility of creating config map and mounting them with one 
single spec.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No - not yet released
   
   ### How was this patch tested?
   
   
   Pass the CIs
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51508] Support `collect(): [[String]]` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun opened a new pull request, #17:
URL: https://github.com/apache/spark-connect-swift/pull/17

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51504] Support `select/limit/sort/orderBy/isEmpty` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on PR #16:
URL: 
https://github.com/apache/spark-connect-swift/pull/16#issuecomment-2722806877

   Thank you so much, @viirya !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51508] Support `collect(): [[String]]` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #17:
URL: 
https://github.com/apache/spark-connect-swift/pull/17#discussion_r1994560977


##
Sources/SparkConnect/DataFrame.swift:
##
@@ -58,7 +58,7 @@ public actor DataFrame: Sendable {
 
   /// Add `Apache Arrow`'s `RecordBatch`s to the internal array.
   /// - Parameter batches: An array of ``RecordBatch``.
-  private func addBathes(_ batches: [RecordBatch]) {
+  private func addBatches(_ batches: [RecordBatch]) {

Review Comment:
   This is a typo fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51508] Support `collect(): [[String]]` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #17:
URL: 
https://github.com/apache/spark-connect-swift/pull/17#discussion_r1994561915


##
Sources/SparkConnect/SparkSession.swift:
##
@@ -45,12 +45,10 @@ public actor SparkSession {
   ///   - userID: an optional user ID. If absent, `SPARK_USER` environment or 
``ProcessInfo.processInfo.userName`` is used.
   init(_ connection: String, _ userID: String? = nil) {
 let processInfo = ProcessInfo.processInfo
-#if os(iOS) || os(watchOS) || os(tvOS)
-let userName = processInfo.environment["SPARK_USER"] ?? ""
-#elseif os(macOS) || os(Linux)
+#if os(macOS) || os(Linux)

Review Comment:
   I simplified the implementation in a new way to cover `os(visionOS)` too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51508] Support `collect(): [[String]]` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #17:
URL: 
https://github.com/apache/spark-connect-swift/pull/17#discussion_r1994561481


##
Sources/SparkConnect/SparkConnectClient.swift:
##
@@ -275,9 +275,11 @@ public actor SparkConnectClient {
 let expressions: [Spark_Connect_Expression.SortOrder] = cols.map {
   var expression = Spark_Connect_Expression.SortOrder()
   expression.child.exprType = 
.unresolvedAttribute($0.toUnresolvedAttribute)
+  expression.direction = .ascending
   return expression
 }
 sort.order = expressions
+sort.isGlobal = true

Review Comment:
   I piggy-back this fix while checking a test coverage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51508] Support `collect(): [[String]]` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #17:
URL: 
https://github.com/apache/spark-connect-swift/pull/17#discussion_r1994569044


##
Tests/SparkConnectTests/DataFrameTests.swift:
##
@@ -125,19 +125,25 @@ struct DataFrameTests {
 await spark.stop()
   }
 
+#if !os(Linux)

Review Comment:
   - Like `show()`, `collect()` currently has a binary compatibility issue on 
`os(Linux)`.
   - On MacOS, all tests pass.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51443][SS] Fix singleVariantColumn in DSv2 and readStream. [spark]

2025-03-13 Thread via GitHub


cloud-fan commented on PR #50217:
URL: https://github.com/apache/spark/pull/50217#issuecomment-2721276493

   thanks, merging to master/4.0!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51487][PYTHON][INFRA] Refresh testing images for pyarrow 19 [spark]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on PR #50255:
URL: https://github.com/apache/spark/pull/50255#issuecomment-2722887996

   Given the assessment, I'm okay to merge this. Feel free to merge, 
@zhengruifeng .
   
   cc @cloud-fan as the release manager of Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

2025-03-13 Thread via GitHub


attilapiros commented on code in PR #50033:
URL: https://github.com/apache/spark/pull/50033#discussion_r1994377379


##
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala:
##
@@ -167,7 +167,10 @@ abstract class BaseYarnClusterSuite extends SparkFunSuite 
with Matchers {
   extraJars: Seq[String] = Nil,
   extraConf: Map[String, String] = Map(),
   extraEnv: Map[String, String] = Map(),
-  outFile: Option[File] = None): SparkAppHandle.State = {
+  outFile: Option[File] = None,
+  testTimeOut: Int = 3, // minutes
+  timeOutIntervalCheck: Int = 1 // seconds

Review Comment:
   Why `Int` why not `Duration`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51506][PYTHON][SS] Do not enforce users to implement close() in TransformWithStateInPandas [spark]

2025-03-13 Thread via GitHub


bogao007 opened a new pull request, #50272:
URL: https://github.com/apache/spark/pull/50272

   
   
   ### What changes were proposed in this pull request?
   
   Do not enforce users to implement `close()` in TransformWithStateInPandas 
since `close()` is an optional function to implement.
   
   ### Why are the changes needed?
   
   Optional function should not enforce users to implement. This also aligns 
with Scala TWS.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   
   ### How was this patch tested?
   
   Updated the existing unit test by not implement `close()` function in 
several stateful processors.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

2025-03-13 Thread via GitHub


ahshahid commented on code in PR #50033:
URL: https://github.com/apache/spark/pull/50033#discussion_r1994490858


##
resource-managers/yarn/pom.xml:
##
@@ -37,6 +37,11 @@
   spark-core_${scala.binary.version}
   ${project.version}
 
+
+  org.apache.spark
+  spark-sql_${scala.binary.version}
+  ${project.version}
+

Review Comment:
   For now I will add it to Test scope. Then let me try it to reproduce using 
just RDD.s



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

2025-03-13 Thread via GitHub


ahshahid commented on code in PR #50033:
URL: https://github.com/apache/spark/pull/50033#discussion_r1994509791


##
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala:
##
@@ -167,7 +167,10 @@ abstract class BaseYarnClusterSuite extends SparkFunSuite 
with Matchers {
   extraJars: Seq[String] = Nil,
   extraConf: Map[String, String] = Map(),
   extraEnv: Map[String, String] = Map(),
-  outFile: Option[File] = None): SparkAppHandle.State = {
+  outFile: Option[File] = None,
+  testTimeOut: Int = 3, // minutes
+  timeOutIntervalCheck: Int = 1 // seconds

Review Comment:
   Did not think about it.. Thank you !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51508] Support `collect(): [[String]]` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun commented on PR #17:
URL: 
https://github.com/apache/spark-connect-swift/pull/17#issuecomment-2723173515

   Could you review this, @viirya ? I added the first implementation for 
`collect()` API for users and for easy testing method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51504] Support `select/limit/sort/orderBy/isEmpty` for `DataFrame` [spark-connect-swift]

2025-03-13 Thread via GitHub


dongjoon-hyun closed pull request #16: [SPARK-51504] Support 
`select/limit/sort/orderBy/isEmpty` for `DataFrame`
URL: https://github.com/apache/spark-connect-swift/pull/16


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50763][SQL] Add Analyzer rule for resolving SQL table functions [spark]

2025-03-13 Thread via GitHub


cloud-fan commented on PR #49471:
URL: https://github.com/apache/spark/pull/49471#issuecomment-2723198248

   thanks, merging to master/4.0! (This is the last piece of the SQL UDF 
feature)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50763][SQL] Add Analyzer rule for resolving SQL table functions [spark]

2025-03-13 Thread via GitHub


cloud-fan closed pull request #49471: [SPARK-50763][SQL] Add Analyzer rule for 
resolving SQL table functions
URL: https://github.com/apache/spark/pull/49471


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation [spark]

2025-03-13 Thread via GitHub


pan3793 commented on PR #50232:
URL: https://github.com/apache/spark/pull/50232#issuecomment-2720217305

   > Hey, why are any manual tests listed in the PR desc not included in UTs?
   
   Given that we have excluded `hive-llap-*` deps from STS modules, the 
existing STS SQL tests should cover all my manual test cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51450][CORE] BarrierCoordinator thread not exiting in Spark standalone mode [spark]

2025-03-13 Thread via GitHub


beliefer commented on PR #50223:
URL: https://github.com/apache/spark/pull/50223#issuecomment-2720399615

   At that time, https://issues.apache.org/jira/browse/SPARK-46895 is an 
improvement, not a bug fix.
   You means JVM will not exit even if the branch-3.5 uses `Timer`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51450][CORE] BarrierCoordinator thread not exiting in Spark standalone mode [spark]

2025-03-13 Thread via GitHub


beliefer commented on PR #50223:
URL: https://github.com/apache/spark/pull/50223#issuecomment-2720555409

   cc @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51450][CORE] BarrierCoordinator thread not exiting in Spark standalone mode [spark]

2025-03-13 Thread via GitHub


jjayadeep06 commented on PR #50223:
URL: https://github.com/apache/spark/pull/50223#issuecomment-2720538216

   > At that time, https://issues.apache.org/jira/browse/SPARK-46895 is an 
improvement, not a bug fix. You means JVM will not exit even if the branch-3.5 
uses `Timer`?
   
   yes, and if you look at the changes, it is fixing the existing `Timer` 
implementation. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SQL] Add the default time formatter [spark]

2025-03-13 Thread via GitHub


MaxGekk opened a new pull request, #50266:
URL: https://github.com/apache/spark/pull/50266

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51402][SQL][TESTS] Test TimeType in UDF [spark]

2025-03-13 Thread via GitHub


MaxGekk closed pull request #50194: [SPARK-51402][SQL][TESTS] Test TimeType in 
UDF
URL: https://github.com/apache/spark/pull/50194


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51402][SQL][TESTS] Test TimeType in UDF [spark]

2025-03-13 Thread via GitHub


MaxGekk commented on PR #50194:
URL: https://github.com/apache/spark/pull/50194#issuecomment-2720630196

   @calilisantos Congratulations with your first contribution to Apache Spark!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >