[PR] [SPARK-51136][HISTORYSERVER] Set CallerContext for History Server [spark]

2025-02-09 Thread via GitHub


cnauroth opened a new pull request, #49858:
URL: https://github.com/apache/spark/pull/49858

   ### What changes were proposed in this pull request?
   
   Initialize the Hadoop RPC `CallerContext` during History Server startup, 
before `FileSystem` access. Calls to HDFS will get tagged in the audit log as 
originating from the History Server.
   
   ### Why are the changes needed?
   
   Other Spark processes set the `CallerContext`, so that additional auditing 
context propagates in Hadoop RPC calls. This PR provides auditing context for 
calls from the History Server. Other callers provide additional information 
like app ID, attempt ID, etc. We don't provide that here through History 
Server, which serves multiple apps/attempts.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. In environments that configure `hadoop.caller.context.enabled=true`, 
users will now see additional information in the HDFS audit logs explicitly 
stating that calls originated from the History Server.
   
   ### How was this patch tested?
   
   A new unit test has been added. All tests pass in the history package.
   
   ```
   build/mvn -pl core test -Dtest=none 
-DmembersOnlySuites=org.apache.spark.deploy.history
   ```
   
   When the changes are deployed to a running cluster, the new caller context 
is visible in the HDFS audit logs.
   
   ```
   2025-02-07 23:00:54,657 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true 
ugi=spark (auth:SIMPLE) ip=/10.240.5.205cmd=open
src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0012
  dst=nullperm=null   proto=rpc   
callerContext=SPARK_HISTORY
   2025-02-07 23:00:54,683 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true 
ugi=spark (auth:SIMPLE) ip=/10.240.5.205cmd=open
src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0011
  dst=nullperm=null   proto=rpc   
callerContext=SPARK_HISTORY
   2025-02-07 23:00:54,699 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true 
ugi=spark (auth:SIMPLE) ip=/10.240.5.205cmd=open
src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0011
  dst=nullperm=null   proto=rpc   
callerContext=SPARK_HISTORY
   2025-02-07 23:00:54,715 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true 
ugi=spark (auth:SIMPLE) ip=/10.240.5.205cmd=open
src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0010
  dst=nullperm=null   proto=rpc   
callerContext=SPARK_HISTORY
   2025-02-07 23:00:54,729 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true 
ugi=spark (auth:SIMPLE) ip=/10.240.5.205cmd=open
src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0010
  dst=nullperm=null   proto=rpc   
callerContext=SPARK_HISTORY
   2025-02-07 23:00:54,743 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true 
ugi=spark (auth:SIMPLE) ip=/10.240.5.205cmd=open
src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0009
  dst=nullperm=null   proto=rpc   
callerContext=SPARK_HISTORY
   2025-02-07 23:00:54,755 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true 
ugi=spark (auth:SIMPLE) ip=/10.240.5.205cmd=open
src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0009
  dst=nullperm=null   proto=rpc   
callerContext=SPARK_HISTORY
   2025-02-07 23:00:54,767 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true 
ugi=spark (auth:SIMPLE) ip=/10.240.5.205cmd=open
src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0008
  dst=nullperm=null   proto=rpc   
callerContext=SPARK_HISTORY
   2025-02-07 23:00:54,779 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true 
ugi=spark (auth:SIMPLE) ip=/10.240.5.205cmd=open
src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0008
  dst=nullperm=null   proto=rpc   
callerContext=SPARK_HISTORY
   2025-02-07 23:01:04,160 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true 
ugi=spark (auth:SIMPLE) ip=/10.240.5.205cmd=listStatus  
src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history dst=null
perm=null   proto=rpc   callerContext=SPARK_HISTORY
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to

Re: [PR] [SPARK-51136][HISTORYSERVER] Set CallerContext for History Server [spark]

2025-02-09 Thread via GitHub


cnauroth commented on code in PR #49858:
URL: https://github.com/apache/spark/pull/49858#discussion_r1948244266


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -3151,9 +3152,31 @@ private[spark] object Utils
   }
 }
 
-private[util] object CallerContext extends Logging {

Review Comment:
   I needed to relax visibility on things here to facilitate unit testing with 
caller context enabled. LMK if a different approach is preferred (reflection?).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51138][PYTHON][CONNECT][TESTS] Skip pyspark.sql.tests.connect.test_parity_frame_plot_plotly.FramePlotPlotlyParityTests.test_area_plot [spark]

2025-02-09 Thread via GitHub


HyukjinKwon commented on PR #49859:
URL: https://github.com/apache/spark/pull/49859#issuecomment-2646679978

   cc @xinrong-meng mind taking a look please? I think the plot tests fail with 
dependencies with different versions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51138][PYTHON][CONNECT][TESTS] Skip pyspark.sql.tests.connect.test_parity_frame_plot_plotly.FramePlotPlotlyParityTests.test_area_plot [spark]

2025-02-09 Thread via GitHub


HyukjinKwon opened a new pull request, #49859:
URL: https://github.com/apache/spark/pull/49859

   ### What changes were proposed in this pull request?
   
   This PR proposes to skip 
pyspark.sql.tests.connect.test_parity_frame_plot_plotly.FramePlotPlotlyParityTests.test_area_plot.
   
   ### Why are the changes needed?
   
   One failure here stops the Python build so it can't test others. Filed a 
JIRA to reenable https://issues.apache.org/jira/browse/SPARK-51137
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   Will monitor the build.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51138][PYTHON][CONNECT][TESTS] Skip pyspark.sql.tests.connect.test_parity_frame_plot_plotly.FramePlotPlotlyParityTests.test_area_plot [spark]

2025-02-09 Thread via GitHub


HyukjinKwon commented on PR #49859:
URL: https://github.com/apache/spark/pull/49859#issuecomment-2646680247

   cc @zhengruifeng too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [TYPING] Add type overloads for inplace dataframe operations [spark]

2025-02-09 Thread via GitHub


github-actions[bot] commented on PR #48662:
URL: https://github.com/apache/spark/pull/48662#issuecomment-2646681728

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49730][SQL] classify syntax errors for pgsql, mysql, sqlserver and h2 [spark]

2025-02-09 Thread via GitHub


github-actions[bot] commented on PR #48368:
URL: https://github.com/apache/spark/pull/48368#issuecomment-2646681753

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-49827][SQL] Fetching all partitions from hive metastore in batches [spark]

2025-02-09 Thread via GitHub


github-actions[bot] commented on PR #48337:
URL: https://github.com/apache/spark/pull/48337#issuecomment-2646681764

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50101][SQL] Fix collated behavior of StringToMap expression [spark]

2025-02-09 Thread via GitHub


github-actions[bot] closed pull request #48642: [SPARK-50101][SQL] Fix collated 
behavior of StringToMap expression
URL: https://github.com/apache/spark/pull/48642


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948345527


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -255,4 +304,26 @@ case class ExpressionEquals(e: Expression) {
  * Instead of appending to a mutable list/buffer of Expressions, just update 
the "flattened"
  * useCount in this wrapper in-place.
  */
-case class ExpressionStats(expr: Expression)(var useCount: Int)
+case class ExpressionStats(expr: Expression)(
+var useCount: Int = 1,
+var conditionalUseCount: Int = 0) {
+  def getUseCount(): Int = if (useCount > 0) {
+useCount + conditionalUseCount
+  } else {
+0
+  }
+}
+
+/**
+ * A wrapper for the different types of children of expressions. 
`alwaysChildren` are child
+ * expressions that will always be evaluated and should be considered for 
subexpressions.
+ * `commonChildren` are children such that if there are any common expressions 
among them, those
+ * should be considered for subexpressions. `conditionalChildren` are children 
that are
+ * conditionally evaluated, such as in If, CaseWhen, or Coalesce expressions, 
and should only
+ * be considered for subexpressions if they are evaluated non-conditionally 
elsewhere.
+ */
+case class RecurseChildren(
+alwaysChildren: Seq[Expression],
+commonChildren: Seq[Seq[Expression]] = Nil,

Review Comment:
   do we have an example this `commonChildren`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SQL][SPARK-51113] Fix correctness with UNION/EXCEPT/INTERSECT inside a view or EXECUTE IMMEDIATE [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49835:
URL: https://github.com/apache/spark/pull/49835#discussion_r1948346938


##
sql/core/src/test/resources/sql-tests/inputs/view-correctness.sql:
##
@@ -0,0 +1,421 @@
+-- This test suite checks the correctness of queries over views
+
+-- SPARK-51113, UNION
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1 AS num
+  UNION
+  SELECT 2
+  UNION
+  SELECT 3
+  UNION
+  SELECT 4
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1
+  UNION
+  SELECT 2
+  UNION
+  SELECT 3
+  UNION
+  SELECT 4
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 'a' AS a
+  UNION
+  SELECT 'b'
+  UNION
+  SELECT 'c'
+  UNION
+  SELECT 'd'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 'a'
+  UNION
+  SELECT 'b'
+  UNION
+  SELECT 'c'
+  UNION
+  SELECT 'd'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1, 'a'
+  UNION
+  SELECT 2, 'b'
+  UNION
+  SELECT 3, 'c'
+  UNION
+  SELECT 4, 'd'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1 AS num, 'a'
+  UNION
+  SELECT 2, 'b'
+  UNION
+  SELECT 3, 'c'
+  UNION
+  SELECT 4, 'd'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1, 'a' AS a
+  UNION
+  SELECT 2, 'b'
+  UNION
+  SELECT 3, 'c'
+  UNION
+  SELECT 4, 'd'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1 AS num, 'a' AS a
+  UNION
+  SELECT 2, 'b'
+  UNION
+  SELECT 3, 'c'
+  UNION
+  SELECT 4, 'd'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+-- SPARK-51113, UNION ALL
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1 AS num
+  UNION ALL
+  SELECT 2
+  UNION ALL
+  SELECT 3
+  UNION ALL
+  SELECT 4
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1
+  UNION ALL
+  SELECT 2
+  UNION ALL
+  SELECT 3
+  UNION ALL
+  SELECT 4
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 'a' AS a
+  UNION ALL
+  SELECT 'b'
+  UNION ALL
+  SELECT 'c'
+  UNION ALL
+  SELECT 'd'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 'a'
+  UNION ALL
+  SELECT 'b'
+  UNION ALL
+  SELECT 'c'
+  UNION ALL
+  SELECT 'd'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1, 'a'
+  UNION ALL
+  SELECT 2, 'b'
+  UNION ALL
+  SELECT 3, 'c'
+  UNION ALL
+  SELECT 4, 'd'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1 AS num, 'a'
+  UNION ALL
+  SELECT 2, 'b'
+  UNION ALL
+  SELECT 3, 'c'
+  UNION ALL
+  SELECT 4, 'd'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1, 'a' AS a
+  UNION ALL
+  SELECT 2, 'b'
+  UNION ALL
+  SELECT 3, 'c'
+  UNION ALL
+  SELECT 4, 'd'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1 AS num, 'a' AS a
+  UNION ALL
+  SELECT 2, 'b'
+  UNION ALL
+  SELECT 3, 'c'
+  UNION ALL
+  SELECT 4, 'd'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+-- SPARK-51113, EXCEPT
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1 AS num
+  EXCEPT
+  SELECT 2
+  EXCEPT
+  SELECT 1
+  EXCEPT
+  SELECT 2
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1
+  EXCEPT
+  SELECT 2
+  EXCEPT
+  SELECT 1
+  EXCEPT
+  SELECT 2
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 'a' AS a
+  EXCEPT
+  SELECT 'b'
+  EXCEPT
+  SELECT 'a'
+  EXCEPT
+  SELECT 'b'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 'a'
+  EXCEPT
+  SELECT 'b'
+  EXCEPT
+  SELECT 'a'
+  EXCEPT
+  SELECT 'b'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1, 'a'
+  EXCEPT
+  SELECT 2, 'b'
+  EXCEPT
+  SELECT 1, 'a'
+  EXCEPT
+  SELECT 2, 'b'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1 AS num, 'a'
+  EXCEPT
+  SELECT 2, 'b'
+  EXCEPT
+  SELECT 1, 'a'
+  EXCEPT
+  SELECT 2, 'b'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1, 'a' AS a
+  EXCEPT
+  SELECT 2, 'b'
+  EXCEPT
+  SELECT 1, 'a'
+  EXCEPT
+  SELECT 2, 'b'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1 AS num, 'a' AS a
+  EXCEPT
+  SELECT 2, 'b'
+  EXCEPT
+  SELECT 1, 'a'
+  EXCEPT
+  SELECT 2, 'b'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+-- SPARK-51113, INTERSECT
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1 AS num
+  INTERSECT
+  SELECT 1
+  INTERSECT
+  SELECT 2
+  INTERSECT
+  SELECT 2
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 1
+  INTERSECT
+  SELECT 1
+  INTERSECT
+  SELECT 2
+  INTERSECT
+  SELECT 2
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 'a' AS a
+  INTERSECT
+  SELECT 'a'
+  INTERSECT
+  SELECT 'b'
+  INTERSECT
+  SELECT 'b'
+;
+SELECT * FROM v1;
+DROP VIEW v1;
+
+CREATE TEMPORARY VIEW v1 AS
+  SELECT 'a'
+  INTERSECT
+  SELECT 'a'
+  INTERSECT
+  SELECT 'b'
+  INTERSECT
+  SELECT 'b'

Review Comment:
   For the test coverage 

Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


LuciferYang commented on PR #49854:
URL: https://github.com/apache/spark/pull/49854#issuecomment-2646829396

   Just got back from vacation, I'll take a look tonight.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


wayneguow commented on code in PR #49854:
URL: https://github.com/apache/spark/pull/49854#discussion_r1948351826


##
pom.xml:
##
@@ -615,16 +615,17 @@
 org.jvnet.staxex
 stax-ex
   
-  
   
-jakarta.activation

Review Comment:
   Thanks for pointing this out, I updated it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


wayneguow commented on PR #49854:
URL: https://github.com/apache/spark/pull/49854#issuecomment-2646834860

   > Should the change be mentioned in the migration guide? Otherwise LGTM
   
   Add a change description in `ml-migration-guide.md`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51139][ML][CONNECT] Refine error class `MLAttributeNotAllowedException` [spark]

2025-02-09 Thread via GitHub


zhengruifeng opened a new pull request, #49860:
URL: https://github.com/apache/spark/pull/49860

   
   
   ### What changes were proposed in this pull request?
   Refine error class `MLAttributeNotAllowedException`
   
   
   ### Why are the changes needed?
   this error message should contains the class name
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, error change
   
   
   ### How was this patch tested?
   updated tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


pan3793 commented on code in PR #49854:
URL: https://github.com/apache/spark/pull/49854#discussion_r1948320823


##
pom.xml:
##
@@ -599,7 +599,7 @@
   
 org.glassfish.jaxb
 jaxb-runtime
-2.3.2
+4.0.5

Review Comment:
   I mean `com.sun.xml.fastinfoset:FastInfoset` and `org.jvnet.staxex:stax-ex`, 
they are marked as optional in `org.glassfish.jaxb:jaxb-runtime:4.0.5`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


pan3793 commented on code in PR #49854:
URL: https://github.com/apache/spark/pull/49854#discussion_r1948320823


##
pom.xml:
##
@@ -599,7 +599,7 @@
   
 org.glassfish.jaxb
 jaxb-runtime
-2.3.2
+4.0.5

Review Comment:
   I mean `com.sun.xml.fastinfoset:FastInfoset` and `org.jvnet.staxex:stax-ex`, 
they are marked as optional in `org.glassfish.jaxb:jaxb-runtime:4.0.5`



##
pom.xml:
##
@@ -615,16 +615,17 @@
 org.jvnet.staxex
 stax-ex
   
-  
   
-jakarta.activation

Review Comment:
   I mean `com.sun.xml.fastinfoset:FastInfoset` and `org.jvnet.staxex:stax-ex`, 
they are marked as optional in `org.glassfish.jaxb:jaxb-runtime:4.0.5`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50917][EXAMPLES] Add SparkConnectPi Scala example to work both for Connect and Classic [spark]

2025-02-09 Thread via GitHub


yaooqinn commented on PR #49617:
URL: https://github.com/apache/spark/pull/49617#issuecomment-2646776731

   Thank you @cloud-fan and @dongjoon-hyun, SparkDataFramePi sounds good to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


pan3793 commented on PR #49854:
URL: https://github.com/apache/spark/pull/49854#issuecomment-2646775721

   Should the change be mentioned in the migration guide? Otherwise LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49840:
URL: https://github.com/apache/spark/pull/49840#discussion_r1948322440


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##
@@ -265,6 +267,19 @@ object Literal {
   s"Literal must have a corresponding value to ${dataType.catalogString}, 
" +
   s"but class ${Utils.getSimpleName(value.getClass)} found.")
   }
+
+  def fromSQL(sql: String): Expression = {
+CatalystSqlParser.parseExpression(sql).transformUp {

Review Comment:
   Looking at the test failure, the missing piece is that `Literal#sql` can 
generate CAST which needs to be resolved to attach timezone. So we should add 
one more case match
   ```
   case c: Cast if c.needsTimeZone =>
 c.withTimeZone(conf.sessionLocalTimeZone)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49840:
URL: https://github.com/apache/spark/pull/49840#discussion_r1948322440


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##
@@ -265,6 +267,19 @@ object Literal {
   s"Literal must have a corresponding value to ${dataType.catalogString}, 
" +
   s"but class ${Utils.getSimpleName(value.getClass)} found.")
   }
+
+  def fromSQL(sql: String): Expression = {
+CatalystSqlParser.parseExpression(sql).transformUp {

Review Comment:
   Looking at the test failure, the missing piece is that `Literal#sql` can 
generate CAST which needs to be resolved to attach timezone. So we should add 
one more case match
   ```
   case c: Cast if c.needsTimeZone =>
 c.withTimeZone(SQLConf.get.sessionLocalTimeZone)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48530][SQL] Support for local variables in SQL Scripting [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49445:
URL: https://github.com/apache/spark/pull/49445#discussion_r1948357356


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/v2/SetVariableExec.scala:
##
@@ -17,26 +17,29 @@
 
 package org.apache.spark.sql.execution.command.v2
 
+import java.util.Locale
+
 import org.apache.spark.SparkException
-import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.catalog.TempVariableManager
+import org.apache.spark.sql.catalyst.{InternalRow, 
SqlScriptingLocalVariableManager}
+import org.apache.spark.sql.catalyst.analysis.FakeLocalCatalog
 import org.apache.spark.sql.catalyst.expressions.{Attribute, Literal, 
VariableReference}
 import org.apache.spark.sql.catalyst.trees.UnaryLike
+import 
org.apache.spark.sql.errors.QueryCompilationErrors.unresolvedVariableError
 import org.apache.spark.sql.execution.SparkPlan
 import org.apache.spark.sql.execution.datasources.v2.V2CommandExec
 
 /**
  * Physical plan node for setting a variable.
  */
-case class SetVariableExec(variables: Seq[VariableReference], query: SparkPlan)
-  extends V2CommandExec with UnaryLike[SparkPlan] {
+case class SetVariableExec(
+variables: Seq[VariableReference],
+query: SparkPlan) extends V2CommandExec with UnaryLike[SparkPlan] {

Review Comment:
   unnecessary code style change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48530][SQL] Support for local variables in SQL Scripting [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49445:
URL: https://github.com/apache/spark/pull/49445#discussion_r1948357547


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/v2/SetVariableExec.scala:
##
@@ -47,21 +50,51 @@ case class SetVariableExec(variables: 
Seq[VariableReference], query: SparkPlan)
   val row = values(0)
   variables.zipWithIndex.foreach { case (v, index) =>
 val value = row.get(index, v.dataType)
-createVariable(variableManager, v, value)
+setVariable(v, value)
   }
 }
 Seq.empty
   }
 
-  private def createVariable(
-  variableManager: TempVariableManager,
+  private def setVariable(
   variable: VariableReference,
   value: Any): Unit = {
-variableManager.create(
-  variable.identifier.name,
+val namePartsCaseAdjusted = if 
(session.sessionState.conf.caseSensitiveAnalysis) {
+  variable.originalNameParts
+} else {
+  variable.originalNameParts.map(_.toLowerCase(Locale.ROOT))
+}
+
+val tempVariableManager = 
session.sessionState.catalogManager.tempVariableManager
+val scriptingVariableManager = SqlScriptingLocalVariableManager.get()
+
+val variableManager = scriptingVariableManager
+  .filter(_ => variable.catalog == FakeLocalCatalog)
+  // If a local variable with nameParts exists, set it using 
scriptingVariableManager.
+//  .filter(_.get(namePartsCaseAdjusted).isDefined)

Review Comment:
   why is this commented out?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51142][ML][CONNECT] ML protobufs clean up [spark]

2025-02-09 Thread via GitHub


zhengruifeng opened a new pull request, #49862:
URL: https://github.com/apache/spark/pull/49862

   
   
   ### What changes were proposed in this pull request?
   ML protobufs clean up
   
   
   ### Why are the changes needed?
   to follow the guide 
https://github.com/apache/spark/blob/ece14704cc083f17689d2e0b9ab8e31cf71a7a2d/sql/connect/docs/adding-proto-messages.md
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   existing tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48530][SQL] Support for local variables in SQL Scripting [spark]

2025-02-09 Thread via GitHub


dusantism-db commented on code in PR #49445:
URL: https://github.com/apache/spark/pull/49445#discussion_r1948366905


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/v2/SetVariableExec.scala:
##
@@ -47,21 +50,51 @@ case class SetVariableExec(variables: 
Seq[VariableReference], query: SparkPlan)
   val row = values(0)
   variables.zipWithIndex.foreach { case (v, index) =>
 val value = row.get(index, v.dataType)
-createVariable(variableManager, v, value)
+setVariable(v, value)
   }
 }
 Seq.empty
   }
 
-  private def createVariable(
-  variableManager: TempVariableManager,
+  private def setVariable(
   variable: VariableReference,
   value: Any): Unit = {
-variableManager.create(
-  variable.identifier.name,
+val namePartsCaseAdjusted = if 
(session.sessionState.conf.caseSensitiveAnalysis) {
+  variable.originalNameParts
+} else {
+  variable.originalNameParts.map(_.toLowerCase(Locale.ROOT))
+}
+
+val tempVariableManager = 
session.sessionState.catalogManager.tempVariableManager
+val scriptingVariableManager = SqlScriptingLocalVariableManager.get()
+
+val variableManager = scriptingVariableManager
+  .filter(_ => variable.catalog == FakeLocalCatalog)
+  // If a local variable with nameParts exists, set it using 
scriptingVariableManager.
+//  .filter(_.get(namePartsCaseAdjusted).isDefined)

Review Comment:
   Old code, deleted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48530][SQL] Support for local variables in SQL Scripting [spark]

2025-02-09 Thread via GitHub


dusantism-db commented on code in PR #49445:
URL: https://github.com/apache/spark/pull/49445#discussion_r1948367347


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/v2/SetVariableExec.scala:
##
@@ -17,26 +17,29 @@
 
 package org.apache.spark.sql.execution.command.v2
 
+import java.util.Locale
+
 import org.apache.spark.SparkException
-import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.catalog.TempVariableManager
+import org.apache.spark.sql.catalyst.{InternalRow, 
SqlScriptingLocalVariableManager}
+import org.apache.spark.sql.catalyst.analysis.FakeLocalCatalog
 import org.apache.spark.sql.catalyst.expressions.{Attribute, Literal, 
VariableReference}
 import org.apache.spark.sql.catalyst.trees.UnaryLike
+import 
org.apache.spark.sql.errors.QueryCompilationErrors.unresolvedVariableError
 import org.apache.spark.sql.execution.SparkPlan
 import org.apache.spark.sql.execution.datasources.v2.V2CommandExec
 
 /**
  * Physical plan node for setting a variable.
  */
-case class SetVariableExec(variables: Seq[VariableReference], query: SparkPlan)
-  extends V2CommandExec with UnaryLike[SparkPlan] {
+case class SetVariableExec(
+variables: Seq[VariableReference],
+query: SparkPlan) extends V2CommandExec with UnaryLike[SparkPlan] {

Review Comment:
   fixed



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##
@@ -266,22 +275,42 @@ trait ColumnResolutionHelper extends Logging with 
DataTypeErrorsBase {
   }
 }
 
-if (maybeTempVariableName(nameParts)) {
-  val variableName = if (conf.caseSensitiveAnalysis) {
-nameParts.last
-  } else {
-nameParts.last.toLowerCase(Locale.ROOT)
-  }
-  catalogManager.tempVariableManager.get(variableName).map { varDef =>
+val namePartsCaseAdjusted = if (conf.caseSensitiveAnalysis) {
+  nameParts
+} else {
+  nameParts.map(_.toLowerCase(Locale.ROOT))
+}
+
+SqlScriptingLocalVariableManager.get()
+  // If we are in EXECUTE IMMEDIATE lookup only session variables.
+  .filterNot(_ => AnalysisContext.get.isExecuteImmediate)
+  // If variable name is qualified with system.session. or 
session.

Review Comment:
   applied



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51142][ML][CONNECT] ML protobufs clean up [spark]

2025-02-09 Thread via GitHub


zhengruifeng commented on code in PR #49862:
URL: https://github.com/apache/spark/pull/49862#discussion_r1948375370


##
sql/connect/common/src/main/protobuf/spark/connect/ml_common.proto:
##
@@ -33,24 +33,32 @@ message MlParams {
 
 // MLOperator represents the ML operators like (Estimator, Transformer or 
Evaluator)
 message MlOperator {
-  // The qualified name of the ML operator.
+  // (Required) The qualified name of the ML operator.
   string name = 1;
-  // Unique id of the ML operator
+
+  // (Required) Unique id of the ML operator
   string uid = 2;
-  // Represents what the ML operator is
+
+  // (Required) Represents what the ML operator is
   OperatorType type = 3;
+
   enum OperatorType {
-UNSPECIFIED = 0;
-ESTIMATOR = 1;
-TRANSFORMER = 2;
-EVALUATOR = 3;
-MODEL = 4;
+OPERATOR_TYPE_UNSPECIFIED = 0;

Review Comment:
   rename to follow the protobuf style guide
   
   https://protobuf.dev/programming-guides/style/#enums



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51142][ML][CONNECT] ML protobufs clean up [spark]

2025-02-09 Thread via GitHub


zhengruifeng commented on code in PR #49862:
URL: https://github.com/apache/spark/pull/49862#discussion_r1948375370


##
sql/connect/common/src/main/protobuf/spark/connect/ml_common.proto:
##
@@ -33,24 +33,32 @@ message MlParams {
 
 // MLOperator represents the ML operators like (Estimator, Transformer or 
Evaluator)
 message MlOperator {
-  // The qualified name of the ML operator.
+  // (Required) The qualified name of the ML operator.
   string name = 1;
-  // Unique id of the ML operator
+
+  // (Required) Unique id of the ML operator
   string uid = 2;
-  // Represents what the ML operator is
+
+  // (Required) Represents what the ML operator is
   OperatorType type = 3;
+
   enum OperatorType {
-UNSPECIFIED = 0;
-ESTIMATOR = 1;
-TRANSFORMER = 2;
-EVALUATOR = 3;
-MODEL = 4;
+OPERATOR_TYPE_UNSPECIFIED = 0;

Review Comment:
   rename to follow the protobuf style guide
   
   > Prefix every value with the enum name (converted to UPPER_SNAKE_CASE)
   
   https://protobuf.dev/programming-guides/style/#enums



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51142][ML][CONNECT] ML protobufs clean up [spark]

2025-02-09 Thread via GitHub


zhengruifeng commented on PR #49862:
URL: https://github.com/apache/spark/pull/49862#issuecomment-2646882798

   cc @grundprinzip and @wbo4958 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48530][SQL] Support for local variables in SQL Scripting [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49445:
URL: https://github.com/apache/spark/pull/49445#discussion_r1948356321


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala:
##
@@ -73,28 +95,49 @@ class ResolveCatalogs(val catalogManager: CatalogManager)
 }
   }
 
-  private def resolveVariableName(nameParts: Seq[String]): ResolvedIdentifier 
= {
-def ident: Identifier = 
Identifier.of(Array(CatalogManager.SESSION_NAMESPACE), nameParts.last)
-if (nameParts.length == 1) {
-  ResolvedIdentifier(FakeSystemCatalog, ident)
-} else if (nameParts.length == 2) {
-  if (nameParts.head.equalsIgnoreCase(CatalogManager.SESSION_NAMESPACE)) {
-ResolvedIdentifier(FakeSystemCatalog, ident)
-  } else {
-throw QueryCompilationErrors.unresolvedVariableError(
-  nameParts, Seq(CatalogManager.SYSTEM_CATALOG_NAME, 
CatalogManager.SESSION_NAMESPACE))
-  }
-} else if (nameParts.length == 3) {
-  if (nameParts(0).equalsIgnoreCase(CatalogManager.SYSTEM_CATALOG_NAME) &&
-nameParts(1).equalsIgnoreCase(CatalogManager.SESSION_NAMESPACE)) {
-ResolvedIdentifier(FakeSystemCatalog, ident)
-  } else {
-throw QueryCompilationErrors.unresolvedVariableError(
-  nameParts, Seq(CatalogManager.SYSTEM_CATALOG_NAME, 
CatalogManager.SESSION_NAMESPACE))
-  }
-} else {
+  private def resolveCreateVariableName(nameParts: Seq[String]): 
ResolvedIdentifier = {
+val resolvedIdentifier = SqlScriptingLocalVariableManager.get()
+  .filterNot(_ => AnalysisContext.get.isExecuteImmediate)
+  .getOrElse(catalogManager.tempVariableManager)
+  .resolveIdentifier(nameParts.last)

Review Comment:
   what's wrong  with my proposal 
https://github.com/apache/spark/pull/49445/files#r1946867299 ?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala:
##
@@ -73,28 +95,49 @@ class ResolveCatalogs(val catalogManager: CatalogManager)
 }
   }
 
-  private def resolveVariableName(nameParts: Seq[String]): ResolvedIdentifier 
= {
-def ident: Identifier = 
Identifier.of(Array(CatalogManager.SESSION_NAMESPACE), nameParts.last)
-if (nameParts.length == 1) {
-  ResolvedIdentifier(FakeSystemCatalog, ident)
-} else if (nameParts.length == 2) {
-  if (nameParts.head.equalsIgnoreCase(CatalogManager.SESSION_NAMESPACE)) {
-ResolvedIdentifier(FakeSystemCatalog, ident)
-  } else {
-throw QueryCompilationErrors.unresolvedVariableError(
-  nameParts, Seq(CatalogManager.SYSTEM_CATALOG_NAME, 
CatalogManager.SESSION_NAMESPACE))
-  }
-} else if (nameParts.length == 3) {
-  if (nameParts(0).equalsIgnoreCase(CatalogManager.SYSTEM_CATALOG_NAME) &&
-nameParts(1).equalsIgnoreCase(CatalogManager.SESSION_NAMESPACE)) {
-ResolvedIdentifier(FakeSystemCatalog, ident)
-  } else {
-throw QueryCompilationErrors.unresolvedVariableError(
-  nameParts, Seq(CatalogManager.SYSTEM_CATALOG_NAME, 
CatalogManager.SESSION_NAMESPACE))
-  }
-} else {
+  private def resolveCreateVariableName(nameParts: Seq[String]): 
ResolvedIdentifier = {
+val resolvedIdentifier = SqlScriptingLocalVariableManager.get()
+  .filterNot(_ => AnalysisContext.get.isExecuteImmediate)
+  .getOrElse(catalogManager.tempVariableManager)
+  .resolveIdentifier(nameParts.last)

Review Comment:
   what's wrong with my proposal 
https://github.com/apache/spark/pull/49445/files#r1946867299 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48530][SQL] Support for local variables in SQL Scripting [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49445:
URL: https://github.com/apache/spark/pull/49445#discussion_r1948356233


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala:
##
@@ -73,28 +95,49 @@ class ResolveCatalogs(val catalogManager: CatalogManager)
 }
   }
 
-  private def resolveVariableName(nameParts: Seq[String]): ResolvedIdentifier 
= {
-def ident: Identifier = 
Identifier.of(Array(CatalogManager.SESSION_NAMESPACE), nameParts.last)
-if (nameParts.length == 1) {
-  ResolvedIdentifier(FakeSystemCatalog, ident)
-} else if (nameParts.length == 2) {
-  if (nameParts.head.equalsIgnoreCase(CatalogManager.SESSION_NAMESPACE)) {
-ResolvedIdentifier(FakeSystemCatalog, ident)
-  } else {
-throw QueryCompilationErrors.unresolvedVariableError(
-  nameParts, Seq(CatalogManager.SYSTEM_CATALOG_NAME, 
CatalogManager.SESSION_NAMESPACE))
-  }
-} else if (nameParts.length == 3) {
-  if (nameParts(0).equalsIgnoreCase(CatalogManager.SYSTEM_CATALOG_NAME) &&
-nameParts(1).equalsIgnoreCase(CatalogManager.SESSION_NAMESPACE)) {
-ResolvedIdentifier(FakeSystemCatalog, ident)
-  } else {
-throw QueryCompilationErrors.unresolvedVariableError(
-  nameParts, Seq(CatalogManager.SYSTEM_CATALOG_NAME, 
CatalogManager.SESSION_NAMESPACE))
-  }
-} else {
+  private def resolveCreateVariableName(nameParts: Seq[String]): 
ResolvedIdentifier = {
+val resolvedIdentifier = SqlScriptingLocalVariableManager.get()
+  .filterNot(_ => AnalysisContext.get.isExecuteImmediate)
+  .getOrElse(catalogManager.tempVariableManager)
+  .resolveIdentifier(nameParts.last)

Review Comment:
   we should either use `assert` to call out all the assumptions or make the 
code general.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48530][SQL] Support for local variables in SQL Scripting [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49445:
URL: https://github.com/apache/spark/pull/49445#discussion_r1948357083


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/VariableManager.scala:
##
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.catalog
+
+import javax.annotation.concurrent.GuardedBy
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.expressions.Literal
+import org.apache.spark.sql.connector.catalog.{CatalogManager, Identifier}
+import 
org.apache.spark.sql.connector.catalog.CatalogManager.{SESSION_NAMESPACE, 
SYSTEM_CATALOG_NAME}
+import org.apache.spark.sql.errors.DataTypeErrorsBase
+import 
org.apache.spark.sql.errors.QueryCompilationErrors.unresolvedVariableError
+
+/**
+ * Trait which provides an interface for variable managers. Methods are case 
sensitive regarding
+ * the variable name/nameParts/identifier, callers are responsible to
+ * format them w.r.t. case-sensitive config.
+ */
+trait VariableManager {
+  /**
+   * Create a variable.
+   * @param identifier Identifier of the variable.
+   * @param defaultValueSQL SQL text of the variable's DEFAULT expression.
+   * @param initValue Initial value of the variable.
+   * @param overrideIfExists If true, the new variable will replace an 
existing one
+   * with the same identifier, if it exists.
+   */
+  def create(
+  identifier: Identifier,
+  defaultValueSQL: String,
+  initValue: Literal,
+  overrideIfExists: Boolean): Unit
+
+  /**
+   * Set an existing variable to a new value.
+ *
+   * @param nameParts Name parts of the variable.
+   * @param defaultValueSQL SQL text of the variable's DEFAULT expression.
+   * @param initValue Initial value of the variable.
+   * @param identifier Identifier of the variable.
+   */
+  def set(
+  nameParts: Seq[String],

Review Comment:
   why do we keep both Identifier and Seq[String]? what's the point of this 
duplication?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51133][BUILD] Upgrade Apache `commons-pool2` to 2.12.1 [spark]

2025-02-09 Thread via GitHub


LuciferYang commented on PR #49856:
URL: https://github.com/apache/spark/pull/49856#issuecomment-2646847348

   Merged into master. Thanks @wayneguow 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48530][SQL] Support for local variables in SQL Scripting [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49445:
URL: https://github.com/apache/spark/pull/49445#discussion_r1948354168


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##
@@ -266,22 +275,42 @@ trait ColumnResolutionHelper extends Logging with 
DataTypeErrorsBase {
   }
 }
 
-if (maybeTempVariableName(nameParts)) {
-  val variableName = if (conf.caseSensitiveAnalysis) {
-nameParts.last
-  } else {
-nameParts.last.toLowerCase(Locale.ROOT)
-  }
-  catalogManager.tempVariableManager.get(variableName).map { varDef =>
+val namePartsCaseAdjusted = if (conf.caseSensitiveAnalysis) {
+  nameParts
+} else {
+  nameParts.map(_.toLowerCase(Locale.ROOT))
+}
+
+SqlScriptingLocalVariableManager.get()
+  // If we are in EXECUTE IMMEDIATE lookup only session variables.
+  .filterNot(_ => AnalysisContext.get.isExecuteImmediate)
+  // If variable name is qualified with system.session. or 
session.

Review Comment:
   ```suggestion
 // If variable name is qualified with session.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51133][BUILD] Upgrade Apache `commons-pool2` to 2.12.1 [spark]

2025-02-09 Thread via GitHub


LuciferYang closed pull request #49856: [SPARK-51133][BUILD] Upgrade Apache 
`commons-pool2` to 2.12.1
URL: https://github.com/apache/spark/pull/49856


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51138][PYTHON][CONNECT][TESTS] Skip pyspark.sql.tests.connect.test_parity_frame_plot_plotly.FramePlotPlotlyParityTests.test_area_plot [spark]

2025-02-09 Thread via GitHub


zhengruifeng commented on PR #49859:
URL: https://github.com/apache/spark/pull/49859#issuecomment-2646858360

   This failure also happens in 
https://github.com/apache/spark/actions/workflows/build_python_3.11_macos.yml
   
   after check the history, I think it is due to `plotly` upgrade
   
   @LuciferYang and @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51143][PYTHON] Pin `plotly==5.24.1` [spark]

2025-02-09 Thread via GitHub


zhengruifeng opened a new pull request, #49863:
URL: https://github.com/apache/spark/pull/49863

   
   
   ### What changes were proposed in this pull request?
   Pin `plotly==5.24.1`
   
   
   ### Why are the changes needed?
   the latest plotlly 6.0 has causes many plot-related test failures
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   manually checked with
   ```
   python/run-tests -k --python-executables python3 --testnames 
'pyspark.sql.tests.connect.test_parity_frame_plot_plotly 
FramePlotPlotlyParityTests.test_pie_plot'
   ```
   
   before:
   ```
   (spark_312) ➜  spark git:(pin_plotly) python/run-tests -k 
--python-executables python3 --testnames 
'pyspark.sql.tests.connect.test_parity_frame_plot_plotly 
FramePlotPlotlyParityTests.test_pie_plot'
   Running PySpark tests. Output is in 
/Users/ruifeng.zheng/Dev/spark/python/unit-tests.log
   Will test against the following Python executables: ['python3']
   Will test the following Python tests: 
['pyspark.sql.tests.connect.test_parity_frame_plot_plotly 
FramePlotPlotlyParityTests.test_pie_plot']
   python3 python_implementation is CPython
   python3 version is: Python 3.12.9
   Starting test(python3): 
pyspark.sql.tests.connect.test_parity_frame_plot_plotly 
FramePlotPlotlyParityTests.test_pie_plot (temp output: 
/Users/ruifeng.zheng/Dev/spark/python/target/4d10075d-bb7b-4d4b-b17d-edbef2f7/python3__pyspark.sql.tests.connect.test_parity_frame_plot_plotly_FramePlotPlotlyParityTests.test_pie_plot__6qxzu16x.log)
   
   Running tests...
   --
   WARNING: Using incubator modules: jdk.incubator.vector
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/conf.py:64: 
UserWarning: Failed to set 
spark.connect.execute.reattachable.senderMaxStreamDuration to Some(1s) due to 
[CANNOT_MODIFY_CONFIG] Cannot modify the value of the Spark config: 
"spark.connect.execute.reattachable.senderMaxStreamDuration".
   See also 
'https://spark.apache.org/docs/latest/sql-migration-guide.html#ddl-statements'. 
SQLSTATE: 46110
 warnings.warn(warn)
   /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/conf.py:64: 
UserWarning: Failed to set 
spark.connect.execute.reattachable.senderMaxStreamSize to Some(123) due to 
[CANNOT_MODIFY_CONFIG] Cannot modify the value of the Spark config: 
"spark.connect.execute.reattachable.senderMaxStreamSize".
   See also 
'https://spark.apache.org/docs/latest/sql-migration-guide.html#ddl-statements'. 
SQLSTATE: 46110
 warnings.warn(warn)
 test_pie_plot 
(pyspark.sql.tests.connect.test_parity_frame_plot_plotly.FramePlotPlotlyParityTests.test_pie_plot)
 ... FAIL (1.760s)
   
   ==
   FAIL [1.760s]: test_pie_plot 
(pyspark.sql.tests.connect.test_parity_frame_plot_plotly.FramePlotPlotlyParityTests.test_pie_plot)
   --
   Traceback (most recent call last):
 File 
"/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/tests/plot/test_frame_plot_plotly.py",
 line 318, in test_pie_plot
   self._check_fig_data(fig["data"][0], **expected_fig_data_sales)
 File 
"/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/tests/plot/test_frame_plot_plotly.py",
 line 81, in _check_fig_data
   self.assertEqual(converted_values, expected_value)
   AssertionError: Lists differ: [15173568000, 1519776,[37 
chars]] != [datetime.datetime(2018, 1, 31, 0, 0), dat[105 chars], 0)]
   
   First differing element 0:
   datetime.datetime(2018, 1, 31, 0, 0)
   
   - [15173568000,
   -  1519776,
   -  15224544000,
   -  15250464000]
   + [datetime.datetime(2018, 1, 31, 0, 0),
   +  datetime.datetime(2018, 2, 28, 0, 0),
   +  datetime.datetime(2018, 3, 31, 0, 0),
   +  datetime.datetime(2018, 4, 30, 0, 0)]
   
   --
   Ran 1 test in 5.573s
   
   FAILED (failures=1)
   
   Generating XML reports...
   Generated XML report: 
target/test-reports/TEST-pyspark.sql.tests.connect.test_parity_frame_plot_plotly.FramePlotPlotlyParityTests-20250210120410.xml
   
   Had test failures in pyspark.sql.tests.connect.test_parity_frame_plot_plotly 
FramePlotPlotlyParityTests.test_pie_plot with python3; see logs.
   
   ```
   
   after:
   ```
   (spark_312) ➜  spark git:(pin_plotly) python/run-tests -k 
--python-executables python3 --testnames 
'pyspark.sql.tests.connect.test_parity_frame_plot_plotly 
FramePlotPlotlyParityTests.test_pie_plot'
   Running PySpark tests. Output is in 
/Users/ruifeng.zheng/Dev/spark/python/unit-tests.log
   Will test against the following Python executables: ['python3']
   Will test the followi

Re: [PR] [SPARK-51143][PYTHON] Pin `plotly==5.24.1` [spark]

2025-02-09 Thread via GitHub


zhengruifeng commented on PR #49863:
URL: https://github.com/apache/spark/pull/49863#issuecomment-2646873479

   also cc @cloud-fan we probably need to pin `plotly` in the release docker 
image


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51139][ML][CONNECT] Refine error class `MLAttributeNotAllowedException` [spark]

2025-02-09 Thread via GitHub


zhengruifeng closed pull request #49860: [SPARK-51139][ML][CONNECT] Refine 
error class `MLAttributeNotAllowedException`
URL: https://github.com/apache/spark/pull/49860


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51139][ML][CONNECT] Refine error class `MLAttributeNotAllowedException` [spark]

2025-02-09 Thread via GitHub


zhengruifeng commented on PR #49860:
URL: https://github.com/apache/spark/pull/49860#issuecomment-2646875702

   thanks, merged to master/4.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51143][PYTHON] Pin `plotly==5.24.1` [spark]

2025-02-09 Thread via GitHub


zhengruifeng commented on PR #49863:
URL: https://github.com/apache/spark/pull/49863#issuecomment-2647094992

   update in the docker file trigger the refresh of the cache, and `torch` is 
also upgraded and caused 
   
   ```
   ==
   ERROR [42.116s]: test_save_load 
(pyspark.ml.tests.connect.test_connect_classification.ClassificationTestsOnConnect.test_save_load)
   --
   Traceback (most recent call last):
 File 
"/__w/spark/spark/python/pyspark/ml/tests/connect/test_legacy_mode_classification.py",
 line 185, in test_save_load
   lor_torch_model = torch.load(
 ^^^
 File "/usr/local/lib/python3.11/dist-packages/torch/serialization.py", 
line 1470, in load
   raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
   _pickle.UnpicklingError: Weights only load failed. This file can still be 
loaded, to do so you have two options, do those steps only if you trust the 
source of the checkpoint. 
(1) In PyTorch 2.6, we changed the default value of the `weights_only` 
argument in `torch.load` from `False` to `True`. Re-running `torch.load` with 
`weights_only` set to `False` will likely succeed, but it can result in 
arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the 
recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL 
torch.nn.modules.container.Sequential was not an allowed global by default. 
Please use `torch.serialization.add_safe_globals([Sequential])` or the 
`torch.serialization.safe_globals([Sequential])` context manager to allowlist 
this global if you trust this class/function.
   
   Check the documentation of torch.load to learn more about types accepted by 
default with weights_only 
https://pytorch.org/docs/stable/generated/torch.load.html.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


vruusmann commented on PR #49854:
URL: https://github.com/apache/spark/pull/49854#issuecomment-2646143491

   A quick memory dump on JPMML-Model evolution:
   
   - `1.5.X`. Migrating from PMML schema 4.3 to 4.4. JDK 8 compatible.
   - `1.6.X`. API upgrades. Specifically, replacing `org.dmg.pmml.FieldName` 
(pseudo-enum for representing field names) with plain `java.lang.String`. Also, 
migrating from `javax.xml.bind.*` API to `jakarta.xml.bind.*` API. JDK 8 
compatible.
   - `1.7.X`. Upgrading from JDK 8 to JDK 11. The codebase is effectively 
unchanged. Just picking up up-to-date dependencies.
   
   The `org.glassfish.jaxb:jaxb-*` dependencies are chosen based on their JDK 
compaibility. For example, `org.glassfish.jaxb:jaxb-runtime:3.X` versions are 
targeting JDK 8, whereas `org.glassfish.jaxb:jaxb-runtime:4.X` are targeting 
JDK 11. Apache Spark 4.X is targeting JDK 17, no?
   
   As for encoding PMML schema versions and XML namespaces, then there's a 
`org.dmg.pmml.Version` enum class for this purpose. You could define a class 
constant `SPARK_PMML_VERSION` somewhere, and then access its `#version` and 
`#namespaceURI` string attributes:
   
   ```java
   org.dmg.pmml.Version SPARK_PMML_VERSION = org.dmg.pmml.Version.PMML_4_4;
   
   System.out.println(SPARK_PMML_VERSION .getVersion()); // Prints "4.4"
   System.out.println(SPARK_PMML_VERSION.getNamespaceURI()); // Prints 
"http://www.dmg.org/PMML-4_4";
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


wayneguow commented on PR #49854:
URL: https://github.com/apache/spark/pull/49854#issuecomment-2646149933

   > Apache Spark 4.X is targeting JDK 17, no?
   
   Yes, Spark 4.x is targeting JDK 17.
   
   
https://github.com/apache/spark/blob/301b666a1fcbd4c59d96c53fe3a547ea1512f397/pom.xml#L117


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


vruusmann commented on PR #49854:
URL: https://github.com/apache/spark/pull/49854#issuecomment-2646164469

   > Yes, Spark 4.x is targeting JDK 17.
   
   Perhaps it is then possible to be even more aggressive on some transitive 
dependency updates (than JPMML-Model 1.7.1 proposes).
   
   My goal is to maintain source compatibility with legacy platforms. One of 
new goals for the `1.7.X` development branches is to achieve 
cross-compilability with JavaScript/WASM, as implemented by the TeaVM platform.
   
   TLDR: The JPMML-Model dependency upgrade is highly unlikely to break 
anything. It's targeting JDK 11, whereas you are already targeting JDK 17. My 
unit and integration tests indicate that JPMML-Model runs fine on all JDK 11+ 
versions (specifically, 11, 17 and 21 LTS versions).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SQL][SPARK-51113] Fix correctness with UNION/EXCEPT/INTERSECT inside a view or EXECUTE IMMEDIATE [spark]

2025-02-09 Thread via GitHub


srielau commented on PR #49835:
URL: https://github.com/apache/spark/pull/49835#issuecomment-2646656062

   > Here's the commit that enabled `SELECT 1 UNION SELECT 2` syntax in SQL 
parser: #40835. This commit enabled view creation with such SQL text 
(`parsePlan` is used for `CREATE VIEW`).
   > 
   > However, even before this commit `parseQuery` parsed `UNION` in `SELECT 1 
UNION SELECT 2` as an alias, it's just we could not create such a view.
   
   I still cannot believe that this is limited to UNION.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SQL][SPARK-51113] Fix correctness with UNION/EXCEPT/INTERSECT inside a view or EXECUTE IMMEDIATE [spark]

2025-02-09 Thread via GitHub


srielau commented on PR #49835:
URL: https://github.com/apache/spark/pull/49835#issuecomment-2646655795

   > `spark.conf.set("spark.sql.ansi.enforceReservedKeywords", "true")` 
mitigates the problem, since UNION becomes a reserved keyword.
   
   True, a rather hefty mitigation, though. Lots of collateral damage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51136][HISTORYSERVER] Set CallerContext for History Server [spark]

2025-02-09 Thread via GitHub


cnauroth commented on PR #49858:
URL: https://github.com/apache/spark/pull/49858#issuecomment-2646610534

   If approved, can this also go into branch-3.5 please? The cherry-pick would 
need a minor merge conflict resolution in `FsHistoryProvider` import 
statements, or I can send a separate pull request.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51111][DSTREAM] Avoid consumer rebalancing stuck when starting a spark streaming job [spark]

2025-02-09 Thread via GitHub


yabola commented on PR #49831:
URL: https://github.com/apache/spark/pull/49831#issuecomment-2646800578

   @tdas Hi~ could you help review this, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51008][SQL] Add ResultStage for AQE [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49715:
URL: https://github.com/apache/spark/pull/49715#discussion_r1948335832


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -344,56 +346,53 @@ case class AdaptiveSparkPlanExec(
 if (errors.nonEmpty) {
   cleanUpAndThrowException(errors.toSeq, None)
 }
-
-// Try re-optimizing and re-planning. Adopt the new plan if its cost 
is equal to or less
-// than that of the current plan; otherwise keep the current physical 
plan together with
-// the current logical plan since the physical plan's logical links 
point to the logical
-// plan it has originated from.
-// Meanwhile, we keep a list of the query stages that have been 
created since last plan
-// update, which stands for the "semantic gap" between the current 
logical and physical
-// plans. And each time before re-planning, we replace the 
corresponding nodes in the
-// current logical plan with logical query stages to make it 
semantically in sync with
-// the current physical plan. Once a new plan is adopted and both 
logical and physical
-// plans are updated, we can clear the query stage list because at 
this point the two plans
-// are semantically and physically in sync again.
-val logicalPlan = 
replaceWithQueryStagesInLogicalPlan(currentLogicalPlan, stagesToReplace)
-val afterReOptimize = reOptimize(logicalPlan)
-if (afterReOptimize.isDefined) {
-  val (newPhysicalPlan, newLogicalPlan) = afterReOptimize.get
-  val origCost = costEvaluator.evaluateCost(currentPhysicalPlan)
-  val newCost = costEvaluator.evaluateCost(newPhysicalPlan)
-  if (newCost < origCost ||
-(newCost == origCost && currentPhysicalPlan != newPhysicalPlan)) {
-lazy val plans =
-  sideBySide(currentPhysicalPlan.treeString, 
newPhysicalPlan.treeString).mkString("\n")
-logOnLevel(log"Plan changed:\n${MDC(QUERY_PLAN, plans)}")
-cleanUpTempTags(newPhysicalPlan)
-currentPhysicalPlan = newPhysicalPlan
-currentLogicalPlan = newLogicalPlan
-stagesToReplace = Seq.empty[QueryStageExec]
+if (!currentPhysicalPlan.isInstanceOf[ResultQueryStageExec]) {
+  // Try re-optimizing and re-planning. Adopt the new plan if its cost 
is equal to or less
+  // than that of the current plan; otherwise keep the current 
physical plan together with
+  // the current logical plan since the physical plan's logical links 
point to the logical
+  // plan it has originated from.
+  // Meanwhile, we keep a list of the query stages that have been 
created since last plan
+  // update, which stands for the "semantic gap" between the current 
logical and physical
+  // plans. And each time before re-planning, we replace the 
corresponding nodes in the
+  // current logical plan with logical query stages to make it 
semantically in sync with
+  // the current physical plan. Once a new plan is adopted and both 
logical and physical
+  // plans are updated, we can clear the query stage list because at 
this point the two
+  // plans are semantically and physically in sync again.
+  val logicalPlan = 
replaceWithQueryStagesInLogicalPlan(currentLogicalPlan, stagesToReplace)
+  val afterReOptimize = reOptimize(logicalPlan)
+  if (afterReOptimize.isDefined) {
+val (newPhysicalPlan, newLogicalPlan) = afterReOptimize.get
+val origCost = costEvaluator.evaluateCost(currentPhysicalPlan)
+val newCost = costEvaluator.evaluateCost(newPhysicalPlan)
+if (newCost < origCost ||
+  (newCost == origCost && currentPhysicalPlan != newPhysicalPlan)) 
{
+  lazy val plans = sideBySide(
+currentPhysicalPlan.treeString, 
newPhysicalPlan.treeString).mkString("\n")
+  logOnLevel(log"Plan changed:\n${MDC(QUERY_PLAN, plans)}")
+  cleanUpTempTags(newPhysicalPlan)
+  currentPhysicalPlan = newPhysicalPlan
+  currentLogicalPlan = newLogicalPlan
+  stagesToReplace = Seq.empty[QueryStageExec]
+}
   }
 }
 // Now that some stages have finished, we can try creating new stages.
-result = createQueryStages(currentPhysicalPlan)
+result = createQueryStages(fun, currentPhysicalPlan, firstRun = false)
   }
-
-  // Run the final plan when there's no more unfinished stages.
-  currentPhysicalPlan = applyPhysicalRules(
-optimizeQueryStage(result.newPlan, isFinalStage = true),
-postStageCreationRules(supportsColumnar),
-Some((planChangeLogger, "AQE Post Stage Creation")))
-  _isFinalPlan = true
-  executionId.foreach(onUpdatePlan(_, Seq(curren

Re: [PR] [SPARK-48239][INFRA][FOLLOWUP] Update the release docker image to follow what we use in Github Action jobs [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on PR #49851:
URL: https://github.com/apache/spark/pull/49851#issuecomment-2646810575

   The streaming test failure is unrelated, thanks for the review, merging to 
master/4.0!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48239][INFRA][FOLLOWUP] Update the release docker image to follow what we use in Github Action jobs [spark]

2025-02-09 Thread via GitHub


cloud-fan closed pull request #49851: [SPARK-48239][INFRA][FOLLOWUP] Update the 
release docker image to follow what we use in Github Action jobs
URL: https://github.com/apache/spark/pull/49851


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51109][SQL] CTE in subquery expression as grouping column [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on PR #49829:
URL: https://github.com/apache/spark/pull/49829#issuecomment-2646813304

   thanks for the review! merging to master/4.0!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50953][PYTHON][CONNECT] Add support for non-literal paths in VariantGet [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on PR #49609:
URL: https://github.com/apache/spark/pull/49609#issuecomment-2646812807

   The code doesn't compile...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51109][SQL] CTE in subquery expression as grouping column [spark]

2025-02-09 Thread via GitHub


cloud-fan closed pull request #49829: [SPARK-51109][SQL] CTE in subquery 
expression as grouping column
URL: https://github.com/apache/spark/pull/49829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948343729


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -79,7 +86,12 @@ class EquivalentExpressions(
   }
 case _ =>
   if (useCount > 0) {
-map.put(wrapper, ExpressionStats(expr)(useCount))
+val stats = if (conditional) {
+  ExpressionStats(expr)(0, useCount)

Review Comment:
   ```suggestion
 ExpressionStats(expr)(useCount = 0, conditionalUseCount = 
useCount)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948343729


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -79,7 +86,12 @@ class EquivalentExpressions(
   }
 case _ =>
   if (useCount > 0) {
-map.put(wrapper, ExpressionStats(expr)(useCount))
+val stats = if (conditional) {
+  ExpressionStats(expr)(0, useCount)

Review Comment:
   ```suggestion
 ExpressionStats(expr)(useCount = 0, useCount)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948344607


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -99,34 +111,37 @@ class EquivalentExpressions(
* only the common nodes.
* Those common nodes are then removed from the local map and added to the 
final map of
* expressions.
+   *
+   * Conditional expressions are not considered because we are simply looking 
for expressions
+   * evaluated once in each parent expression.
*/
-  private def updateCommonExprs(
-  exprs: Seq[Expression],
-  map: mutable.HashMap[ExpressionEquals, ExpressionStats],

Review Comment:
   shall we update the doc of this method? no equivalenceMap in this method now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50953][PYTHON][CONNECT] Add support for non-literal paths in VariantGet [spark]

2025-02-09 Thread via GitHub


harshmotw-db commented on PR #49609:
URL: https://github.com/apache/spark/pull/49609#issuecomment-2646925700

   @cloud-fan Thanks, I think we're good now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SQL][SPARK-51113] Fix correctness with UNION/EXCEPT/INTERSECT inside a view or EXECUTE IMMEDIATE [spark]

2025-02-09 Thread via GitHub


vladimirg-db commented on PR #49835:
URL: https://github.com/apache/spark/pull/49835#issuecomment-2646625494

   Here's the commit that enabled `SELECT 1 UNION SELECT 2` syntax in SQL 
parser: https://github.com/apache/spark/pull/40835. This commit enabled view 
creation with such SQL text (`parsePlan` is used for `CREATE VIEW`).
   
   However, even before this commit `parseQuery` parsed `UNION` in `SELECT 1 
UNION SELECT 2` as an alias, it's just we could not create such a view.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51140][ML] Sort the params before saving [spark]

2025-02-09 Thread via GitHub


zhengruifeng opened a new pull request, #49861:
URL: https://github.com/apache/spark/pull/49861

   
   
   ### What changes were proposed in this pull request?
   Sort the params before saving
   
   ### Why are the changes needed?
   to improve debugability:
   when developing ml connect, sometime I need to manually check the stored 
models, I notice that 
   the params are saved unsorted, so is hard to compare the params:
   
   before:
   
   ```
   
{"class":"org.apache.spark.ml.clustering.PowerIterationClustering","timestamp":1738926090947,"sparkVersion":"4.1.0-SNAPSHOT","uid":"PowerIterationClustering_a5c66eecaec6","paramMap":{"dstCol":"dst","k":2,"weightCol":"weight","maxIter":40,"srcCol":"src","initMode":"random"},"defaultParamMap":{"dstCol":"dst","k":2,"maxIter":20,"srcCol":"src","initMode":"random"}}
   ```
   
   ```
   
{"class":"org.apache.spark.ml.clustering.PowerIterationClustering","timestamp":1738926386839,"sparkVersion":"4.1.0-SNAPSHOT","uid":"PowerIterationClustering_b91c5734d913","paramMap":{"k":2,"initMode":"random","weightCol":"weight","srcCol":"src","maxIter":40,"dstCol":"dst"},"defaultParamMap":{"k":2,"initMode":"random","srcCol":"src","maxIter":20,"dstCol":"dst"}}
   ```
   
   after:
   ```
   
{"class":"org.apache.spark.ml.clustering.PowerIterationClustering","timestamp":1738926422941,"sparkVersion":"4.1.0-SNAPSHOT","uid":"PowerIterationClustering_96355d38f450","paramMap":{"initMode":"random","maxIter":20,"srcCol":"src","dstCol":"dst","k":2},"defaultParamMap":{"initMode":"random","maxIter":20,"srcCol":"src","dstCol":"dst","k":2}}
   ```
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   existing tests and manually check
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50917][EXAMPLES] Add Pi Scala example to work both for Connect and Classic [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49617:
URL: https://github.com/apache/spark/pull/49617#discussion_r1948330890


##
examples/src/main/scala/org/apache/spark/examples/sql/SparkDataFramePi.scala:
##
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.sql
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.functions._
+
+/** Computes an approximation to pi with SparkSession/DataFrame APIs */
+object SparkDataFramePi {
+  def main(args: Array[String]): Unit = {
+val spark = SparkSession
+  .builder()
+  .appName("Spark Connect Pi")

Review Comment:
   ```suggestion
 .appName("Spark DataFrame Pi")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51067][SQL] Revert session level collation for DML queries and apply object level collation for DDL queries [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49772:
URL: https://github.com/apache/spark/pull/49772#discussion_r1948333282


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultStringTypes.scala:
##
@@ -18,15 +18,15 @@
 package org.apache.spark.sql.catalyst.analysis
 
 import org.apache.spark.sql.catalyst.expressions.{Cast, Expression, Literal}
-import org.apache.spark.sql.catalyst.plans.logical.{AddColumns, AlterColumns, 
AlterColumnSpec, AlterViewAs, ColumnDefinition, CreateView, LogicalPlan, 
QualifiedColType, ReplaceColumns, V1CreateTablePlan, V2CreateTablePlan}
+import org.apache.spark.sql.catalyst.plans.logical.{AddColumns, AlterColumns, 
AlterColumnSpec, AlterTableCommand, AlterViewAs, ColumnDefinition, CreateTable, 
CreateView, LogicalPlan, QualifiedColType, ReplaceColumns, V1CreateTablePlan, 
V2CreateTablePlan}
 import org.apache.spark.sql.catalyst.rules.{Rule, RuleExecutor}
+import org.apache.spark.sql.connector.catalog.TableCatalog
 import org.apache.spark.sql.types.{DataType, StringType}
 
 /**
  * Resolves default string types in queries and commands. For queries, the 
default string type is
- * determined by the session's default string type. For DDL, the default 
string type is the
- * default type of the object (table -> schema -> catalog). However, this is 
not implemented yet.
- * So, we will just use UTF8_BINARY for now.
+ * determined by the default string type, which is UTF8_BINARY. For DDL, the 
default string type

Review Comment:
   This is confusing. For queries, the default string type is handled by the 
parser: when collation is not specified, we use `object StringType`. What does 
this rule need to do?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51135][SQL] Fix ViewResolverSuite for ANSI modes [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on PR #49857:
URL: https://github.com/apache/spark/pull/49857#issuecomment-2646800076

   thanks, merging to master/4.0!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51067][SQL] Revert session level collation for DML queries and apply object level collation for DDL queries [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #49772:
URL: https://github.com/apache/spark/pull/49772#discussion_r1948333844


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultStringTypes.scala:
##
@@ -79,18 +79,37 @@ object ResolveDefaultStringTypes extends Rule[LogicalPlan] {
 expression.exists(e => transformExpression.isDefinedAt(e))
   }
 
-  private def isDefaultSessionCollationUsed: Boolean = conf.defaultStringType 
== StringType
+  /** Default string type is UTF8_BINARY */
+  private def defaultStringType: StringType = StringType("UTF8_BINARY")
 
-  /**
-   * Returns the default string type that should be used in a given DDL 
command (for now always
-   * UTF8_BINARY).
-   */
-  private def stringTypeForDDLCommand(table: LogicalPlan): StringType =
-StringType("UTF8_BINARY")
+  /** Returns the default string type that should be used in a given DDL 
command */
+  private def stringTypeForDDLCommand(table: LogicalPlan): StringType = {
+if (table.isInstanceOf[CreateTable]) {
+  if (table.asInstanceOf[CreateTable].tableSpec.collation.isDefined) {
+return 
StringType(table.asInstanceOf[CreateTable].tableSpec.collation.get)
+  }
+}
+else if (table.isInstanceOf[CreateView]) {
+  if (table.asInstanceOf[CreateView].collation.isDefined) {
+return StringType(table.asInstanceOf[CreateView].collation.get)
+  }
+}
+else if (table.isInstanceOf[AlterTableCommand]) {
+  if (table.asInstanceOf[AlterTableCommand].table.resolved) {
+val collation = Option(table.asInstanceOf[AlterTableCommand]
+  .table.asInstanceOf[ResolvedTable]
+  .table.properties.get(TableCatalog.PROP_COLLATION))
+if (collation.isDefined) {
+  return StringType(collation.get)
+}
+  }
+}
+defaultStringType
+  }
 
-  /** Returns the session default string type */
-  private def sessionDefaultStringType: StringType =
-StringType(conf.defaultStringType.collationId)
+  /** Returns the session default string type used for DML queries */

Review Comment:
   again, there is no session default string type anymore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51135][SQL] Fix ViewResolverSuite for ANSI modes [spark]

2025-02-09 Thread via GitHub


cloud-fan closed pull request #49857: [SPARK-51135][SQL] Fix ViewResolverSuite 
for ANSI modes
URL: https://github.com/apache/spark/pull/49857


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51138][PYTHON][CONNECT][TESTS] Skip pyspark.sql.tests.connect.test_parity_frame_plot_plotly.FramePlotPlotlyParityTests.test_area_plot [spark]

2025-02-09 Thread via GitHub


HyukjinKwon closed pull request #49859: [SPARK-51138][PYTHON][CONNECT][TESTS] 
Skip 
pyspark.sql.tests.connect.test_parity_frame_plot_plotly.FramePlotPlotlyParityTests.test_area_plot
URL: https://github.com/apache/spark/pull/49859


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51138][PYTHON][CONNECT][TESTS] Skip pyspark.sql.tests.connect.test_parity_frame_plot_plotly.FramePlotPlotlyParityTests.test_area_plot [spark]

2025-02-09 Thread via GitHub


HyukjinKwon commented on PR #49859:
URL: https://github.com/apache/spark/pull/49859#issuecomment-2646721135

   Merged to master and branch-4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


zhengruifeng commented on PR #49854:
URL: https://github.com/apache/spark/pull/49854#issuecomment-2646737700

   The JPMML upgrade and related code changes LGTM.
   
   also ping @LuciferYang and @dongjoon-hyun


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


wayneguow commented on code in PR #49854:
URL: https://github.com/apache/spark/pull/49854#discussion_r1948060171


##
pom.xml:
##
@@ -599,7 +599,7 @@
   
 org.glassfish.jaxb
 jaxb-runtime
-2.3.2
+4.0.5

Review Comment:
   There is currently no mutual dependence between the `org.glassfish.jersey` 
series (3.0.16) and the `org.glassfish.jaxb` (4.0.5) series.
   
   But they both depend on `jakarta.xml.bind:jakarta.xml.bind-api` and 
`jakarta.activation:jakarta.activation-api`, and the dependency versions of 
`jaxb` are newer (xml: 4.0.2 vs 3.0.1, activation: 2.1.3 vs 2.0.1).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-51135][SQL] Fix ViewResolverSuite for ANSI modes [spark]

2025-02-09 Thread via GitHub


vladimirg-db opened a new pull request, #49857:
URL: https://github.com/apache/spark/pull/49857

   ### What changes were proposed in this pull request?
   
   Fix `ViewResolverSuite` for non-ANSI mode. View column `Cast`s have have 
ANSI evaluation in non-ANSI mode when view schema `COMPENSATION` is the default 
view schema mode: 
https://github.com/apache/spark/blob/301b666a1fcbd4c59d96c53fe3a547ea1512f397/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala#L965.
   
   Also, use `dsl` package to simplify test code.
   
   ### Why are the changes needed?
   
   To fix tests in non-ANSI mode: 
https://github.com/apache/spark/pull/49658#discussion_r1944266007
   
   ### Does this PR introduce _any_ user-facing change?
   
   Just a test suite change.
   
   ### How was this patch tested?
   
   Fixed the `ViewResolverSuite`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   copilot.nvim.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50982][SQL] Support more SQL/DataFrame read path functionality in single-pass Analyzer [spark]

2025-02-09 Thread via GitHub


vladimirg-db commented on code in PR #49658:
URL: https://github.com/apache/spark/pull/49658#discussion_r1948128138


##
sql/core/src/test/scala/org/apache/spark/sql/analysis/resolver/ViewResolverSuite.scala:
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.analysis.resolver
+
+import org.apache.spark.sql.{AnalysisException, QueryTest}
+import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow}
+import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.analysis.resolver.{MetadataResolver, 
Resolver}
+import org.apache.spark.sql.catalyst.expressions.{Alias, AttributeReference, 
Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{
+  LocalRelation,
+  LogicalPlan,
+  OneRowRelation,
+  Project,
+  SubqueryAlias,
+  View
+}
+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.sql.types.{IntegerType, StringType}
+
+class ViewResolverSuite extends QueryTest with SharedSparkSession {
+  private val catalogName =
+"spark_catalog"
+  private val col1Integer =
+AttributeReference(name = "col1", dataType = IntegerType, nullable = 
false)()
+  private val col2String =
+AttributeReference(name = "col2", dataType = StringType, nullable = 
false)()
+
+  test("Temporary view") {
+withView("temporary_view") {
+  spark.sql("CREATE TEMPORARY VIEW temporary_view AS SELECT col1, col2 
FROM VALUES (1, 'a');")
+
+  checkViewResolution(
+"SELECT * FROM temporary_view",
+expectedChild = Project(
+  projectList = Seq(
+Alias(Cast(col1Integer, 
IntegerType).withTimeZone(conf.sessionLocalTimeZone), "col1")(),
+Alias(Cast(col2String, 
StringType).withTimeZone(conf.sessionLocalTimeZone), "col2")()
+  ),
+  child = Project(
+projectList = Seq(col1Integer, col2String),
+child = LocalRelation(
+  output = Seq(col1Integer, col2String),
+  data = Seq(
+InternalRow.fromSeq(Seq(1, 
"a").map(CatalystTypeConverters.convertToCatalyst))
+  )
+)
+  )
+)
+  )
+}
+  }
+
+  test("Persistent view") {

Review Comment:
   Created a PR: https://github.com/apache/spark/pull/49857



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


wayneguow commented on PR #49854:
URL: https://github.com/apache/spark/pull/49854#issuecomment-2646172436

   @vruusmann  Thank you for sharing these views, much appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-51132][ML][BUILD] Upgrade `JPMML` to 1.7.1 [spark]

2025-02-09 Thread via GitHub


wayneguow commented on code in PR #49854:
URL: https://github.com/apache/spark/pull/49854#discussion_r1948060923


##
pom.xml:
##
@@ -615,16 +615,17 @@
 org.jvnet.staxex
 stax-ex
   
-  
   
-jakarta.activation

Review Comment:
   Hmm, I found that the `jakarta.xml.bind:jakarta.xml.bind-api` exclusion in 
`jersey-server` can be removed, but the `com.sun.activation:jakarta.activation` 
exclusion in `jersey-common` can still be retained, because this dependence is 
currently useless.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org