Re: Iceberg 1.4/spark3.5 seem to have some breaking issue with spark-connect

Nirav Patel Thu, 22 Feb 2024 14:15:46 -0800

Hi Ryan,

I updated the spark-jira I opened with more information I found after
taking heapdump:


https://issues.apache.org/jira/browse/SPARK-46762

 class `org.apache.iceberg.Table` is loaded twice> once by
ChildFirstUrlClassLoader and once by MutableURLClassLoader .

Issue doesn't happen with spark3.4 and iceberg 1.3 as I mentioned in
ticket. do you think it's still a spark-connect issue ? I noticed there's a
slightly bigger migratory changes in iceberg repo going from 1.3 to 1.4 in
order to support spark3.5 . DO you think something might have gotten missed
there?


Thanks
Nirav

On Thu, Jan 18, 2024 at 9:46 AM Nirav Patel <nira...@gmail.com> wrote:

> Classloading does seem like an issue while using it with Spark Connect 3.5
> and iceberg >= 1.4 version only though.
>
> It's weird as I also mentioned in previous email that after adding spark
> property (spark.executor.userClassPathFirst=true) both classes gets loaded
> from same classloader - org.apache.spark.util.ChildFirstURLClassLoader. Not
> sure why error would still happen.
>
>  java.lang.ClassCastException: class
> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to
> class org.apache.iceberg.Table (org.apache.iceberg.spark.source.
> *SerializableTableWithSize* is in unnamed module of loader
> org.apache.spark.util.*ChildFirstURLClassLoader* @a41c33c;
> org.apache.iceberg.*Table* is in unnamed module of loader
> org.apache.spark.util.*ChildFirstURLClassLoader* @16f95afb)
>
>
> On Tue, Jan 16, 2024 at 12:53 PM Ryan Blue <b...@tabular.io> wrote:
>
>> It looks to me like the classloader is the problem. The "child first"
>> classloader is apparently loading `Table`, but Spark is loading
>> `SerializableTableWithSize` from the parent classloader. Because delegation
>> isn't happening properly, you're getting two incompatible classes from the
>> same classpath, depending on where a class was loaded for the first time.
>>
>> On Fri, Jan 12, 2024 at 5:30 PM Nirav Patel <nira...@gmail.com> wrote:
>>
>>> It seem to happening on executor of SC server as I see the error in
>>> executor logs. We did verify that there was only one version of
>>> iceberg-spark-runtime at the moment.
>>> We do include custom catalog imp jar. Though it's a shaded jar I don't
>>> see in "org/apache/iceberg/Table" or other iceberg classes when I do "jar
>>> -tvf" on that.
>>>
>>> I see both jars in 3 spark
>>> configs. spark.repl.local.jars, spark.yarn.dist.jars
>>> and spark.yarn.secondary.jars.
>>>
>>> I suspected classloading issue as well as initially error was pointing
>>> to it:
>>>
>>> pyspark.errors.exceptions.connect.SparkConnectGrpcException:
>>> (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0
>>> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>> 0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2):
>>> java.lang.ClassCastException: class
>>> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to
>>> class org.apache.iceberg.Table
>>> (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed
>>> module of loader org.apache.spark.util.*MutableURLClassLoader*
>>> @6819e13c; org.apache.iceberg.Table is in unnamed module of loader
>>> org.apache.spark.util.*ChildFirstURLClassLoader* @15fb0c43)
>>>
>>> Although *ChildFirstURLClassLoader *is child of MutableURLClassLoader
>>> error shouldn't be related to that.  I still try adding spark flag (--conf
>>> "spark.executor.userClassPathFirst=true") when starting spark connect
>>> server. it seem both classes gets loaded by same ClassLoader but error
>>> still happens:
>>>
>>> pyspark.errors.exceptions.connect.SparkConnectGrpcException:
>>> (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0
>>> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>> 0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2):
>>> java.lang.ClassCastException: class
>>> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to
>>> class org.apache.iceberg.Table
>>> (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed
>>> module of loader org.apache.spark.util.*ChildFirstURLClassLoader*
>>> @a41c33c; org.apache.iceberg.Table is in unnamed module of loader
>>> org.apache.spark.util.*ChildFirstURLClassLoader* @16f95afb)
>>>
>>> I see ClassLoader @ <some_id> in logs. Are those object Ids? (been
>>> awhile working with java) . wondering if multiple instance of same
>>> CLassLoader is being initialized by SC. may be doing --verbose:class or
>>> heap dump help to verify?
>>>
>>>
>>> On Fri, Jan 12, 2024 at 4:38 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> I think it looks like a version mismatch, perhaps between the SC client
>>>> and the server or between where planning occurs and the executors. The
>>>> error is that the `SerializableTableWithSize` is not a subclass of `Table`,
>>>> but it definitely should be. That sort of problem is usually caused by
>>>> class loading issues. Can you double-check that you have only one Iceberg
>>>> runtime in the Environment tab of your Spark cluster?
>>>>
>>>> On Tue, Jan 9, 2024 at 4:57 PM Nirav Patel <nira...@gmail.com> wrote:
>>>>
>>>>> PS - issue doesn't happen if we don't use spark-connect and instead
>>>>> just use spark-shell or pyspark as OP in github said as well. however
>>>>> stacktrace desont seem to point any of the class from spark-connect jar
>>>>> (org.apache.spark:spark-connect_2.12:3.5.0).
>>>>>
>>>>> On Tue, Jan 9, 2024 at 4:52 PM Nirav Patel <nira...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> We are testing spark-connect with iceberg.
>>>>>> We tried spark 3.5, iceberg 1.4.x versions (all of
>>>>>> iceberg-spark-runtime-3.5_2.12-1.4.x.jar)
>>>>>>
>>>>>> with all the 1.4.x jars we are having following issue when running
>>>>>> iceberg queries from sparkSession created using spark-connect (--remote
>>>>>> "sc://remote-master-node")
>>>>>>
>>>>>> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be
>>>>>> cast to org.apache.iceberg.Table at
>>>>>> org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88)
>>>>>> at
>>>>>> org.apache.iceberg.spark.source.BatchDataReader.<init>(BatchDataReader.java:50)
>>>>>> at
>>>>>> org.apache.iceberg.spark.source.SparkColumnarReaderFactory.createColumnarReader(SparkColumnarReaderFactory.java:52)
>>>>>> at
>>>>>> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:79)
>>>>>> at
>>>>>> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
>>>>>> at
>>>>>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>>>>>> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
>>>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>>>>>> Source) at
>>>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown
>>>>>> Source) at
>>>>>>
>>>>>> Someone else has reported this issue on github as well:
>>>>>> https://github.com/apache/iceberg/issues/8978
>>>>>>
>>>>>> It's currently working with spark 3.4 and iceberg 1.3 . However
>>>>>> Ideally it'd be nice to get it working with spark 3.5 as well as 3.5 has
>>>>>> many improvements in spark-connect.
>>>>>>
>>>>>> Thanks
>>>>>> Nirav
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Iceberg 1.4/spark3.5 seem to have some breaking issue with spark-connect

Reply via email to