Hi,

1) The cause of the exception:
The dependencies added via pipeline.jars / pipeline.classpaths will be used to 
construct user class loader.

For your job, the exception happens when HadoopUtils.getHadoopConfiguration is 
called. The reason is that HadoopUtils is provided by Flink which is loaded by 
the system classloader instead of user classloader. As the hadoop dependences 
are only available in the user classloader and so ClassNotFoundException was 
raised.

So, the Hadoop dependencies are a little special than the other dependencies as 
they are also dependent by Flink, not only the user coder.

2) How to support HADOOP_CLASSPATH in local mode (e.g. execute in IDE)

Currently, you have to copy the hadoop jars into the PyFlink installation 
directory: site-packages/pyflink/lib/
However, it makes sense to support HADOOP_CLASSPATH environment variable in 
local mode to avoid copy the jars manually. I will create a ticket for this.

3) How to support HADOOP_CLASSPATH in remote mode (e.g. submitted via `flink 
run`)
You could export HADOOP_CLASSPATH=`hadoop classpath` manually before executing 
`flink run`.

Regards,
Dian

> 2021年5月17日 下午8:45,Eduard Tudenhoefner <edu...@dremio.com> 写道:
> 
> Hello,
> 
> I was wondering whether anyone has tried and/or had any luck creating a 
> custom catalog with Iceberg + Flink via the Python API 
> (https://iceberg.apache.org/flink/#custom-catalog 
> <https://iceberg.apache.org/flink/#custom-catalog>)? 
> 
> When doing so, the docs mention that dependencies need to be specified via 
> pipeline.jars / pipeline.classpaths 
> (https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/python/dependency_management/
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/python/dependency_management/>).
>  I'm providing the iceberg flink runtime + the hadoop libs via those and I 
> can confirm that I'm landing in the FlinkCatalogFactory but then things fail 
> because it doesn't see the hadoop dependencies for some reason. 
> 
> What would be the right way to provide the HADOOP_CLASSPATH when using the 
> Python API? I have a minimal code example that shows the issue here: 
> https://gist.github.com/nastra/92bc3bc7b7037d956aa5807988078b8d#file-flink-py-L38
>  
> <https://gist.github.com/nastra/92bc3bc7b7037d956aa5807988078b8d#file-flink-py-L38>
>  
> 
> Thanks

Reply via email to