Hi all,

After updating one of our Flink jobs from 1.18 to 1.20 we started to see a
classloading issue when using file source with Parquet Avro format, which
looks like a regression:

java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
    at
org.apache.flink.formats.parquet.avro.AvroParquetRecordFormat.createReader(AvroParquetRecordFormat.java:86)
    at
org.apache.flink.connector.file.src.impl.StreamFormatAdapter.lambda$createReader$0(StreamFormatAdapter.java:77)

    ...
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.conf.Configuration
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown
Source)


Further digging has shown that this issue was caused by changes to
AvroParquetRecordFormat from FLINK-35015[1][2] - even though class
mentioned in the exception is present in child classloader, exception is
thrown when attempt to access HadoopUtils.getHadoopConfiguration during
creation of the reader.
One path around this is to include hadoop distribution into the image as
mentioned in docs [3], however this leads to significant increase in image
size compared to having necessary dependencies in the application jar.

1 - https://issues.apache.org/jira/browse/FLINK-35015
2 -
https://github.com/apache/flink/blob/release-1.20/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java#L86
3 -
https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/configuration/advanced/#hadoop-dependencies

Reply via email to