pan3793 commented on PR #46521:
URL: https://github.com/apache/spark/pull/46521#issuecomment-2707020792

   > Are you assuming to rebuild all Hive UDF jars here?
   
   @dongjoon-hyun I never made such an assumption, most of the existing UDFs 
should work without any change, except to: the UDFs explicitly imports and uses 
the classes we removed from Spark new releases, it is not limited to CodeHaus 
Jackson, the risk happens each time when we update 
`dev/deps/spark-deps-hadoop-3-hive-2.3`. 
   
   let's say if the CustomUDF built with Hive 2.3.9 uses OkHTTP classes, it 
works well in Spark 3.5 because it ships OkHTTP jar by K8s client 6, but Spark 
4.0 removes OkHTTP jars during K8s client 7 upgrading, then the CustomUDF 
should fail with OkHTTP class not found, by fixing it, the user can either 
shading the deps or add them by `--pacakges`(in this case, no rebuilt required 
because the Hive UDF interface is binary compatible), so it's user's 
responsibility to handle the UDF's transitive deps.
   
   What matters is that we must NOT break the Hive built-in UDF deps, 
otherwise, it blocks `o.a.h.hive.ql.exec.FunctionRegistry` initialization, and 
breaks the whole Hive UDF feature, that's why I argue that SPARK-51029 should 
be reverted.
   
   SPARK-51029 (GitHub PR [1]) removes `hive-llap-common` from the Spark binary 
distributions, which technically
   breaks the feature "Spark SQL supports integration of Hive UDFs, UDAFs and 
UDTFs"[2], more precisely, it change
   Hive UDF support from batteries included to not.
   
   In details, when user runs a query like `CREATE TEMPORARY FUNCTION hello AS 
'my.HelloUDF'`, it triggers
   `o.a.h.hive.ql.exec.FunctionRegistry` initialization, which also initializes 
the Hive built-in UDFs, UDAFs and
   UDTFs[3], then NoClassDefFoundError ocuurs due to some built-in UDTFs depend 
on class in `hive-llap-common`. 
   
   ```
   org.apache.spark.sql.execution.QueryExecutionException: 
java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/llap/security/LlapSigner$Signable
        at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
        at 
java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
        at java.base/java.lang.Class.getConstructor0(Class.java:3578)
        at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754)
        at 
org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
        at 
org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
        at 
org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
        at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500)
        at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160)
        at 
org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
        at 
org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
        at 
org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
        at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
        at 
org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197)
        at 
org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177)
        at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187)
        at 
org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171)
        at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689)
       ...
   ```
   
   Currently (v4.0.0-rc2), user must add the `hive-llap-common` jar explicitly, 
e.g. by using
   `--packages org.apache.hive:hive-llap-common:2.3.10`, to fix the 
NoClassDefFoundError issue, even the `my.HelloUDF`
   does not depend on any class in `hive-llap-common`, this is quite confusing.
   
   [1] https://github.com/apache/spark/pull/49725
   [2] https://spark.apache.org/docs/3.5.5/sql-ref-functions-udf-hive.html
   [3] 
https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L208
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to