Re: Flink Python API and HADOO_CLASSPATH

Eduard Tudenhoefner Tue, 18 May 2021 00:25:10 -0700

Hi Dian,

thanks a lot for the explanation and help. Option 2) is what I needed and
it works.


Regards
Eduard

On Tue, May 18, 2021 at 6:21 AM Dian Fu <dian0511...@gmail.com> wrote:

> Hi,
>
> 1) The cause of the exception:
> The dependencies added via pipeline.jars / pipeline.classpaths will be
> used to construct user class loader.
>
> For your job, the exception happens when
> HadoopUtils.getHadoopConfiguration is called. The reason is that
> HadoopUtils is provided by Flink which is loaded by the system classloader
> instead of user classloader. As the hadoop dependences are only available
> in the user classloader and so ClassNotFoundException was raised.
>
> So, the Hadoop dependencies are a little special than the other
> dependencies as they are also dependent by Flink, not only the user coder.
>
> 2) How to support HADOOP_CLASSPATH in local mode (e.g. execute in IDE)
>
> Currently, you have to copy the hadoop jars into the PyFlink installation
> directory: site-packages/pyflink/lib/
> However, it makes sense to support HADOOP_CLASSPATH environment variable
> in local mode to avoid copy the jars manually. I will create a ticket for
> this.
>
> 3) How to support HADOOP_CLASSPATH in remote mode (e.g. submitted via
> `flink run`)
> You could export HADOOP_CLASSPATH=`hadoop classpath` manually before
> executing `flink run`.
>
> Regards,
> Dian
>
> 2021年5月17日 下午8:45，Eduard Tudenhoefner <edu...@dremio.com> 写道：
>
> Hello,
>
> I was wondering whether anyone has tried and/or had any luck creating a
> custom catalog with Iceberg + Flink via the Python API (
> https://iceberg.apache.org/flink/#custom-catalog)?
>
> When doing so, the docs mention that dependencies need to be specified via
> *pipeline.jars* / *pipeline.classpaths* (
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/python/dependency_management/).
> I'm providing the iceberg flink runtime + the hadoop libs via those and I
> can confirm that I'm landing in the *FlinkCatalogFactory* but then things
> fail because it doesn't see the *hadoop dependencies* for some reason.
>
> What would be the right way to provide the *HADOOP_CLASSPATH* when using
> the Python API? I have a minimal code example that shows the issue here:
> https://gist.github.com/nastra/92bc3bc7b7037d956aa5807988078b8d#file-flink-py-L38
>
>
> Thanks
>
>
>

Re: Flink Python API and HADOO_CLASSPATH

Reply via email to