-1

SPARK-51029 (GitHub PR [1]) removes `hive-llap-common` from Spark binary 
distribution, which technically
breaks the feature "Spark SQL supports integration of Hive UDFs, UDAFs and 
UDTFs"[2], more precisely, it change
Hive UDF support from batteries included to not.

In details, when user runs a query like CREATE TEMPORARY FUNCTION hello AS 
'my.HelloUDF', it triggers
o.a.h.hive.ql.exec.FunctionRegistry initialization, which also initializes the 
Hive built-in UDFs, UDAFs and
UDTFs[3], then NoClassDefFoundError ocuurs due to some built-in UDTFs depend on 
class in hive-llap-common.


org.apache.spark.sql.execution.QueryExecutionException: 
java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/llap/security/LlapSigner$Signable
        at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
        at 
java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
        at java.base/java.lang.Class.getConstructor0(Class.java:3578)
        at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754)
        at 
org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
        at 
org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
        at 
org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
        at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500)
        at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160)
        at 
org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
        at 
org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
        at 
org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
        at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
        at 
org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197)
        at 
org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177)
        at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187)
        at 
org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171)
        at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689)
    …

Currently (v4.0.0-rc2), user must add the hive-llap-common jar explicitly, e.g. 
by using
--packages org.apache.hive:hive-llap-common:2.3.10, to fix the 
NoClassDefFoundError issue, even the my.HelloUDF
does not depend on any class in hive-llap-common, this is quite confusing.

[1] https://github.com/apache/spark/pull/49725
[2] https://spark.apache.org/docs/3.5.5/sql-ref-functions-udf-hive.html
[3] 
https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L208

Thanks,
Cheng Pan



> On Mar 7, 2025, at 13:15, Wenchen Fan <cloud0...@gmail.com> wrote:
> 
> RC2 fails and I'll cut RC3 next week. Thanks for the feedback!
> 
> On Thu, Mar 6, 2025 at 6:44 AM Chris Nauroth <cnaur...@apache.org 
> <mailto:cnaur...@apache.org>> wrote:
>> Here is one more problem I found during RC2 verification:
>> 
>> https://github.com/apache/spark/pull/50173
>> 
>> This one is just a test issue.
>> 
>> Chris Nauroth
>> 
>> 
>> On Tue, Mar 4, 2025 at 2:55 PM Jules Damji <jules.da...@gmail.com 
>> <mailto:jules.da...@gmail.com>> wrote:
>>> - 1 (non-binding)
>>> 
>>> A ran into number of installation and launching problems. May be it’s my 
>>> enviornment, even though I removed any old binaries and packages.
>>> 
>>> 1. Pip installing pyspark4.0.0 and pyspark-connect-4.0 from .tz file 
>>> workedl, launching pyspark results into 
>>> 
>>> 25/03/04 14:00:26 ERROR SparkContext: Error initializing SparkContext.
>>> java.lang.ClassNotFoundException: 
>>> org.apache.spark.sql.connect.SparkConnectPlugin
>>> 
>>> 2. Similary installing the tar balls of either distribution and launch 
>>> spark-shell goes into a loop and terminated by the shutdown hook.
>>> 
>>> Thank you Wenchen for leading these release onerous manager efforts, but 
>>> slowly we should be able to install and launch seamlessly. 
>>> 
>>> Keep up the good work & tireless effort for the Spark community!
>>> 
>>> cheers
>>> Jules
>>> 
>>> WARNING: Using incubator modules: jdk.incubator.vector
>>> 25/03/04 14:49:35 INFO BaseAllocator: Debug mode disabled. Enable with the 
>>> VM option -Darrow.memory.debug.allocator=true.
>>> 25/03/04 14:49:35 INFO DefaultAllocationManagerOption: allocation manager 
>>> type not specified, using netty as the default type
>>> 25/03/04 14:49:35 INFO CheckAllocator: Using DefaultAllocationManager at 
>>> memory/netty/DefaultAllocationManagerFactory.class
>>> Using Spark's default log4j profile: 
>>> org/apache/spark/log4j2-defaults.properties
>>> 25/03/04 14:49:35 WARN GrpcRetryHandler: Non-Fatal error during RPC 
>>> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
>>> exception, retrying (wait=50 ms, currentRetryNum=1, policy=DefaultPolicy).
>>> 25/03/04 14:49:35 WARN GrpcRetryHandler: Non-Fatal error during RPC 
>>> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
>>> exception, retrying (wait=200 ms, currentRetryNum=2, policy=DefaultPolicy).
>>> 25/03/04 14:49:35 WARN GrpcRetryHandler: Non-Fatal error during RPC 
>>> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
>>> exception, retrying (wait=800 ms, currentRetryNum=3, policy=DefaultPolicy).
>>> 25/03/04 14:49:36 WARN GrpcRetryHandler: Non-Fatal error during RPC 
>>> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
>>> exception, retrying (wait=3275 ms, currentRetryNum=4, policy=DefaultPolicy).
>>> 25/03/04 14:49:39 WARN GrpcRetryHandler: Non-Fatal error during RPC 
>>> execution: org.sparkproject.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
>>> exception, retrying (wait=12995 ms, currentRetryNum=5, 
>>> policy=DefaultPolicy).
>>> ^C25/03/04 14:49:40 INFO ShutdownHookManager: Shutdown hook called
>>> 
>>> 
>>> 
>>>> On Mar 4, 2025, at 2:24 PM, Chris Nauroth <cnaur...@apache.org 
>>>> <mailto:cnaur...@apache.org>> wrote:
>>>> 
>>>> -1 (non-binding)
>>>> 
>>>> I think I found some missing license information in the binary 
>>>> distribution. We may want to include this in the next RC:
>>>> 
>>>> https://github.com/apache/spark/pull/50158
>>>> 
>>>> Thank you for putting together this RC, Wenchen.
>>>> 
>>>> Chris Nauroth
>>>> 
>>>> 
>>>> On Mon, Mar 3, 2025 at 6:10 AM Wenchen Fan <cloud0...@gmail.com 
>>>> <mailto:cloud0...@gmail.com>> wrote:
>>>>> Thanks for bringing up these blockers! I know RC2 isn’t fully ready yet, 
>>>>> but with over 70 commits since RC1, it’s time to have a new RC so people 
>>>>> can start testing the latest changes. Please continue testing and keep 
>>>>> the feedback coming!
>>>>> 
>>>>> On Mon, Mar 3, 2025 at 6:06 PM beliefer <belie...@163.com 
>>>>> <mailto:belie...@163.com>> wrote:
>>>>>> -1 
>>>>>> https://github.com/apache/spark/pull/50112 should be merged before 
>>>>>> release.
>>>>>> 
>>>>>> 
>>>>>> At 2025-03-01 15:25:06, "Wenchen Fan" <cloud0...@gmail.com 
>>>>>> <mailto:cloud0...@gmail.com>> wrote:
>>>>>> 
>>>>>> Please vote on releasing the following candidate as Apache Spark version 
>>>>>> 4.0.0.
>>>>>> 
>>>>>> The vote is open until March 5 (PST) and passes if a majority +1 PMC 
>>>>>> votes are cast, with a minimum of 3 +1 votes.
>>>>>> 
>>>>>> [ ] +1 Release this package as Apache Spark 4.0.0
>>>>>> [ ] -1 Do not release this package because ...
>>>>>> 
>>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>> 
>>>>>> The tag to be voted on is v4.0.0-rc2 (commit 
>>>>>> 85188c07519ea809012db24421714bb75b45ab1b)
>>>>>> https://github.com/apache/spark/tree/v4.0.0-rc2
>>>>>> 
>>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc2-bin/
>>>>>> 
>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>> 
>>>>>> The staging repository for this release can be found at:
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1478/
>>>>>> 
>>>>>> The documentation corresponding to this release can be found at:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc2-docs/
>>>>>> 
>>>>>> The list of bug fixes going into 4.0.0 can be found at the following URL:
>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>>>>>> 
>>>>>> This release is using the release script of the tag v4.0.0-rc2.
>>>>>> 
>>>>>> FAQ
>>>>>> 
>>>>>> =========================
>>>>>> How can I help test this release?
>>>>>> =========================
>>>>>> 
>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>> an existing Spark workload and running on this release candidate, then
>>>>>> reporting any regressions.
>>>>>> 
>>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>>>> you can add the staging repository to your projects resolvers and test
>>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>>> you don't end up building with a out of date RC going forward).
>>> 

Reply via email to