pan3793 commented on PR #46521: URL: https://github.com/apache/spark/pull/46521#issuecomment-2707020792
> Are you assuming to rebuild all Hive UDF jars here? @dongjoon-hyun I never made such an assumption, most of the existing UDFs should work without any change, except to: the UDFs explicitly imports and uses the classes we removed from Spark new releases, it is not limited to CodeHaus Jackson, the risk happens each time when we update `dev/deps/spark-deps-hadoop-3-hive-2.3`. let's say if the CustomUDF built with Hive 2.3.9 uses OkHTTP classes, it works well in Spark 3.5 because it ships OkHTTP jar by K8s client 6, but Spark 4.0 removes OkHTTP jars during K8s client 7 upgrading, then the CustomUDF should fail with OkHTTP class not found, by fixing it, the user can either shading the deps or add them by `--pacakges`(in this case, no rebuilt required because the Hive UDF interface is binary compatible), so it's user's responsibility to handle the UDF's transitive deps. What matters is that we must NOT break the Hive built-in UDF deps, otherwise, it blocks `o.a.h.hive.ql.exec.FunctionRegistry` initialization, and breaks the whole Hive UDF feature, that's why I argue that SPARK-51029 should be reverted. SPARK-51029 (GitHub PR [1]) removes `hive-llap-common` from the Spark binary distributions, which technically breaks the feature "Spark SQL supports integration of Hive UDFs, UDAFs and UDTFs"[2], more precisely, it change Hive UDF support from batteries included to not. In details, when user runs a query like `CREATE TEMPORARY FUNCTION hello AS 'my.HelloUDF'`, it triggers `o.a.h.hive.ql.exec.FunctionRegistry` initialization, which also initializes the Hive built-in UDFs, UDAFs and UDTFs[3], then NoClassDefFoundError ocuurs due to some built-in UDTFs depend on class in `hive-llap-common`. ``` org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable at java.base/java.lang.Class.getDeclaredConstructors0(Native Method) at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373) at java.base/java.lang.Class.getConstructor0(Class.java:3578) at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754) at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689) ... ``` Currently (v4.0.0-rc2), user must add the `hive-llap-common` jar explicitly, e.g. by using `--packages org.apache.hive:hive-llap-common:2.3.10`, to fix the NoClassDefFoundError issue, even the `my.HelloUDF` does not depend on any class in `hive-llap-common`, this is quite confusing. [1] https://github.com/apache/spark/pull/49725 [2] https://spark.apache.org/docs/3.5.5/sql-ref-functions-udf-hive.html [3] https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L208 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org