Re: Iceberg 1.4/spark3.5 seem to have some breaking issue with spark-connect

Nirav Patel Fri, 12 Jan 2024 17:30:30 -0800

It seem to happening on executor of SC server as I see the error in
executor logs. We did verify that there was only one version of
iceberg-spark-runtime at the moment.
We do include custom catalog imp jar. Though it's a shaded jar I don't see
in "org/apache/iceberg/Table" or other iceberg classes when I do "jar -tvf"
on that.


I see both jars in 3 spark
configs. spark.repl.local.jars, spark.yarn.dist.jars
and spark.yarn.secondary.jars.

I suspected classloading issue as well as initially error was pointing to
it:

pyspark.errors.exceptions.connect.SparkConnectGrpcException:
(org.apache.spark.SparkException) Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2):
java.lang.ClassCastException: class
org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to
class org.apache.iceberg.Table
(org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed
module of loader org.apache.spark.util.*MutableURLClassLoader* @6819e13c;
org.apache.iceberg.Table is in unnamed module of loader
org.apache.spark.util.*ChildFirstURLClassLoader* @15fb0c43)

Although *ChildFirstURLClassLoader *is child of MutableURLClassLoader error
shouldn't be related to that.  I still try adding spark flag (--conf
"spark.executor.userClassPathFirst=true") when starting spark connect
server. it seem both classes gets loaded by same ClassLoader but error
still happens:

pyspark.errors.exceptions.connect.SparkConnectGrpcException:
(org.apache.spark.SparkException) Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2):
java.lang.ClassCastException: class
org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to
class org.apache.iceberg.Table
(org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed
module of loader org.apache.spark.util.*ChildFirstURLClassLoader* @a41c33c;
org.apache.iceberg.Table is in unnamed module of loader
org.apache.spark.util.*ChildFirstURLClassLoader* @16f95afb)

I see ClassLoader @ <some_id> in logs. Are those object Ids? (been awhile
working with java) . wondering if multiple instance of same CLassLoader is
being initialized by SC. may be doing --verbose:class or heap dump help to
verify?


On Fri, Jan 12, 2024 at 4:38 PM Ryan Blue <b...@tabular.io> wrote:

> I think it looks like a version mismatch, perhaps between the SC client
> and the server or between where planning occurs and the executors. The
> error is that the `SerializableTableWithSize` is not a subclass of `Table`,
> but it definitely should be. That sort of problem is usually caused by
> class loading issues. Can you double-check that you have only one Iceberg
> runtime in the Environment tab of your Spark cluster?
>
> On Tue, Jan 9, 2024 at 4:57 PM Nirav Patel <nira...@gmail.com> wrote:
>
>> PS - issue doesn't happen if we don't use spark-connect and instead just
>> use spark-shell or pyspark as OP in github said as well. however stacktrace
>> desont seem to point any of the class from spark-connect jar
>> (org.apache.spark:spark-connect_2.12:3.5.0).
>>
>> On Tue, Jan 9, 2024 at 4:52 PM Nirav Patel <nira...@gmail.com> wrote:
>>
>>> Hi,
>>> We are testing spark-connect with iceberg.
>>> We tried spark 3.5, iceberg 1.4.x versions (all of
>>> iceberg-spark-runtime-3.5_2.12-1.4.x.jar)
>>>
>>> with all the 1.4.x jars we are having following issue when running
>>> iceberg queries from sparkSession created using spark-connect (--remote
>>> "sc://remote-master-node")
>>>
>>> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast
>>> to org.apache.iceberg.Table at
>>> org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88)
>>> at
>>> org.apache.iceberg.spark.source.BatchDataReader.<init>(BatchDataReader.java:50)
>>> at
>>> org.apache.iceberg.spark.source.SparkColumnarReaderFactory.createColumnarReader(SparkColumnarReaderFactory.java:52)
>>> at
>>> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:79)
>>> at
>>> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
>>> at
>>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>>> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>>> Source) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown
>>> Source) at
>>>
>>> Someone else has reported this issue on github as well:
>>> https://github.com/apache/iceberg/issues/8978
>>>
>>> It's currently working with spark 3.4 and iceberg 1.3 . However Ideally
>>> it'd be nice to get it working with spark 3.5 as well as 3.5 has many
>>> improvements in spark-connect.
>>>
>>> Thanks
>>> Nirav
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Iceberg 1.4/spark3.5 seem to have some breaking issue with spark-connect

Reply via email to