Hi Péter, Thanks for pointing this out!
I was aware of the difference in version between pyflink and some of the JAR dependencies. I was starting out with PyFlink 1.16 and I had some errors when creating the Dockerfile that seemed to be fixed when upgrading the version to 1.18. Thus the result which is the first Dockerfile that seemed to make everything work right off the bat. But as you mentioned, it is probably best practice to keep everything in sync. So I will definitely update it as long as nothing else breaks (JAR compatibilities are my kryptonite) ?? ________________________________ From: Péter Váry <peter.vary.apa...@gmail.com> Sent: Tuesday, April 16, 2024 9:56 PM To: Robert Prat <robert.p...@urbiotica.com> Cc: Oscar Perez via user <user@flink.apache.org> Subject: Re: Pyflink w Nessie and Iceberg in S3 Jars Is it intentional, that you are using iceberg-flink-runtime-1.16-1.3.1.jar with 1.18.0 PyFlink? This might cause issues later. I would try to synchronize the Flink versions throughout all the dependencies. On Tue, Apr 16, 2024, 11:23 Robert Prat <robert.p...@urbiotica.com<mailto:robert.p...@urbiotica.com>> wrote: I finally managed to make it work following the advice of Robin Moffat who replied to the earlier email: There's a lot of permutations that you've described, so it's hard to take one reproducible test case here to try and identify the error :) It certainly looks JAR related. You could try adding hadoop-hdfs-client-3.3.4.jar to your Flink ./lib folder. The other route I would go is look at a functioning Flink -> Iceberg/Nessie environment and work backwards from there. This looks like a good place to start: https://www.dremio.com/blog/using-flink-with-apache-iceberg-and-nessie/ I managed to follow the dremio blog and get itt to work. I'm sharing my dockerfile in case it might help anyone with a similar issue in the future: FROM flink:1.18.0-scala_2.12 ARG DEBIAN_FRONTEND=noninteractive COPY packages.txt . RUN apt-get update && xargs -a packages.txt apt-get -y install && apt-get autoremove RUN mkdir -p /var/run/vsftpd/empty/ ENV TZ=Europe/Madrid ENV AWS_REGION=eu-west-1 RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone COPY requirements.txt . RUN pip3 install --no-cache-dir -r requirements.txt ## RabbitMQ connector flink-sql-connector-rabbitmq-3.0.1-1.17.jar org/apache/flink/flink-sql-connector-rabbitmq/3.0.1-1.17 RUN curl -L https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-rabbitmq/3.0.1-1.17/flink-sql-connector-rabbitmq-3.0.1-1.17.jar -o /opt/flink/lib/flink-sql-connector-rabbitmq-3.0.1-1.17.jar ## Iceberg Flink Library RUN curl -L https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.16/1.3.1/iceberg-flink-runtime-1.16-1.3.1.jar -o /opt/flink/lib/iceberg-flink-runtime-1.16-1.3.1.jar ## Hive Flink Library RUN curl -L https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-hive-2.3.9_2.12/1.16.1/flink-sql-connector-hive-2.3.9_2.12-1.16.1.jar -o /opt/flink/lib/flink-sql-connector-hive-2.3.9_2.12-1.16.1.jar ## Hadoop Common Classes RUN curl -L https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/2.8.3/hadoop-common-2.8.3.jar -o /opt/flink/lib/hadoop-common-2.8.3.jar ## Hadoop AWS Classes RUN curl -L https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.8.3-10.0/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar -o /opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar ## AWS Bundled Classes RUN curl -L https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.20.18/bundle-2.20.18.jar -o /opt/flink/lib/bundle-2.20.18.jar Where my requirements.txt looks like this: apache-flink-libraries==1.18.0 apache-flink==1.18.0 And the packages.txt is: curl findutils python3-pip python3-requests python3-software-properties python-is-python3 I'm not sure I need all the Jars in the dockerfile but as they say if it ain't broke, don't fix it. ________________________________ From: Robert Prat <robert.p...@urbiotica.com<mailto:robert.p...@urbiotica.com>> Sent: Friday, April 12, 2024 3:45 PM To: user@flink.apache.org<mailto:user@flink.apache.org> <user@flink.apache.org<mailto:user@flink.apache.org>> Subject: Pyflink w Nessie and Iceberg in S3 Jars Hi there, For several days I have been trying to find the right configuration for my pipeline which roughly consists in the following schema RabbitMQ->PyFlink->Nessie/Iceberg/S3. For what I am going to explain I have tried both locally and through the official Flink docker images. I have tried several different flink versions, but for simplicity let's say I am using the apache-flink==1.18.0 version. So far I have been able to use the jar in org/apache/iceberg/iceberg-flink-runtime-1.18 to connect to RabbitMQ and obtain the data from some streams, so I'd say the source side is working. After that I have been trying to find a way to send the data in those streams to Iceberg in S3 through Nessie Catalog which is the one I have working. I have been using this pipeline with both Spark and Trino for some time now so I know it is working. Now what I am "simply" trying to do is to use my already set up Nessie catalog through flink. I have tried to connect both directly through the sql-client.sh in the bin of pyflink dir and through python as table_env.execute_sql(f""" CREATE CATALOG nessie WITH ( 'type'='iceberg', 'catalog-impl'='org.apache.iceberg.nessie.NessieCatalog', 'io-impl'='org.apache.iceberg.aws.s3.S3FileIO', 'uri'='http://mynessieip:mynessieport/api/v1', 'ref'='main', 'warehouse'='s3a://mys3warehouse', 's3-access-key-id'='{USER_AWS}', 's3-secret-access-key'='{KEY_AWS}', 's3-path-style-access'='true', 'client-region'='eu-west-1')""") The Jars I have included (One of the many combinations I've tried with no result) in my pyflink/lib dir (i also tried to add them with env.add_jars or --jarfile) are: * hadoop-aws-3.4.0.jar * iceberg-flink-runtime-1.18-1.5.0.jar * hadoop-client-3.4.0.jar * hadoop-common-3.4.0.jar * hadoop-hdfs-3.4.0.jar Right now I am getting the following error message: py4j.protocol.Py4JJavaError: An error occurred while calling o56.executeSql. : java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/HdfsConfiguration (...) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.HdfsConfiguration... But I have gotten several different errors in all the different Jar combinations I have tried. So my request is, does anybody know if my problem is JAR related or if I am doing something else wrong? I would be immensely grateful if someone could guide me to the right steps to implement this pipeline. Thanks:)