Re: How to perform read/write operation for iceberg table present AWS Glue Catalog

Jack Ye Wed, 21 Jun 2023 13:04:54 -0700

- > When we use Iceberg Spark runtime API to interact with Iceberg tables ,
it requires cluster to run the spark job. But when we use Athena simba JDBC
driver to execute the same query, will it be getting distributed across
clusters. i.e How the query is getting executed in Athena?


Athena can be viewed as a managed Trino/Presto engine, the cluster is
invisible to users, every query can be conceptually viewed as running in a
separate and isolated cluster.

EMR Spark runs in cluster mode with EC2 machines provisioned in the user's
account. There are also managed products like EMR Serverless Spark, Glue
ETL Spark, Athena Spark, which has decreasing visibility of the cluster
concept, oriented for different group of users (warehouse admins, ETL data
engineers, data scientists)

- > Could you please also provide your inputs on how Athena is interacting
with Iceberg tables during Read and Write Operation?

In general, it follows the same workflow as most of the other engines.
Iceberg mostly works at the scan node level of the entire query plan. It
also uses table statistics information to affect query plan generation.

For reading, it leverages the Iceberg library to apply predicates and
dynamic filters from the engine to generate scan tasks, and distribute
tasks to workers.

For writing, after files are produced by the engine, it commits the files
through the Iceberg library.
-
- > Is there any JAVA API which we could use to interact with iceberg
tables without using spark.
Yes, see https://iceberg.apache.org/docs/latest/java-api-quickstart/
-
- > What will be the recommended way to perform read and write operation on
Iceberg tables. (Spark or JDBC driver). Is there any performance matrix
available on these comparisons?

There's not really a strict recommendation, it's mostly driven by use case
and user base. Typically we observe Trino-flavor engine to run faster than
Spark, but Spark is used for wider use cases because of the data frame API.

- > We are not able to access personal repo
<https://urldefense.com/v3/__https:/github.com/jackye1995/aws-sdk-java-v2-analytics-bundle__;!!KpaPruflFCEp!jC92faNUvYxOxNNLhg0Rrkwz93XMgMXJtYX6himywHmeLa0C9GZQQIcXP1SlleBPSp3CBia4m95ZHt8WafjJ$>
.Please provide the access for it.
- Updated visibility :)

On Wed, Jun 21, 2023 at 4:51 AM Shetty, Anush <[email protected]>
wrote:

>
>
> Hi Jack,
>
>
>
> Thank you for your quick response.
>
>
>
> We are also trying to understand the integration between Athena , AWS Glue
> Catalog and Iceberg tables
>
>
>
> Could you please help us to understand the following queries.
>
>
>
>    1. When we use Iceberg Spark runtime API to interact with Iceberg
>    tables , it requires cluster to run the spark job. But when we use Athena
>    simba JDBC driver to execute the same query, will it be getting distributed
>    across clusters. i.e How the query is getting executed in Athena?
>    2. Could you please also provide your inputs on how Athena is
>    interacting with Iceberg tables during Read and Write Operation?
>    3. Is there any JAVA API which we could use to interact with iceberg
>    tables without using spark.
>    4. What will be the recommended way to perform read and write
>    operation on Iceberg tables. (Spark or JDBC driver). Is there any
>    performance matrix available on these comparisons?
>    5. We are not able to access personal repo
>    
> <https://urldefense.com/v3/__https:/github.com/jackye1995/aws-sdk-java-v2-analytics-bundle__;!!KpaPruflFCEp!jC92faNUvYxOxNNLhg0Rrkwz93XMgMXJtYX6himywHmeLa0C9GZQQIcXP1SlleBPSp3CBia4m95ZHt8WafjJ$>
>    .Please provide the access for it.
>
>
>
> Thanks and Regards
>
> Anush
>
> *From:* Jack Ye <[email protected]>
> *Sent:* Tuesday, June 20, 2023 10:55 PM
> *To:* [email protected]
> *Cc:* Dayal, Kumar Abhishek <[email protected]>; Shetty, Anush <
> [email protected]>
> *Subject:* Re: How to perform read/write operation for iceberg table
> present AWS Glue Catalog
>
>
>
> Hi,
>
>
>
> Glue catalog works just like any other Iceberg catalogs, by configuring
> related Iceberg catalog properties at engine start time. Within the Iceberg
> project provided integrations, you can use Spark, Flink, Hive by installing
> their respective runtime jars. Those jars package with the iceberg-aws
> module that contains the GlueCatalog implementation. Platforms like EMR
> automatically package the AWS v2 SDK dependencies for you, but if you do
> not use EMR, you will need to include additional AWS v2 SDKs. You can pick
> individual SDK clients, or use the bundle
> <https://urldefense.com/v3/__https:/mvnrepository.com/artifact/software.amazon.awssdk/bundle__;!!KpaPruflFCEp!jC92faNUvYxOxNNLhg0Rrkwz93XMgMXJtYX6himywHmeLa0C9GZQQIcXP1SlleBPSp3CBia4m95ZHg9E7Kgj$>.
> I also have this personal repo
> <https://urldefense.com/v3/__https:/github.com/jackye1995/aws-sdk-java-v2-analytics-bundle__;!!KpaPruflFCEp!jC92faNUvYxOxNNLhg0Rrkwz93XMgMXJtYX6himywHmeLa0C9GZQQIcXP1SlleBPSp3CBia4m95ZHt8WafjJ$>
> as an example for a smaller size bundle.
>
>
>
> For Trino/Presto-flavor engines, yes you can use Athena, and also any
> Trino offering in EMR through the Glue catalog connection:
> https://trino.io/docs/current/connector/iceberg.html#glue-catalog
> <https://urldefense.com/v3/__https:/trino.io/docs/current/connector/iceberg.html*glue-catalog__;Iw!!KpaPruflFCEp!jC92faNUvYxOxNNLhg0Rrkwz93XMgMXJtYX6himywHmeLa0C9GZQQIcXP1SlleBPSp3CBia4m95ZHrOxjsxq$>
>
>
>
> In addition, you can also use pyiceberg to connect to any Python engines
> and libraries: https://py.iceberg.apache.org/configuration/#glue-catalog
> <https://urldefense.com/v3/__https:/py.iceberg.apache.org/configuration/*glue-catalog__;Iw!!KpaPruflFCEp!jC92faNUvYxOxNNLhg0Rrkwz93XMgMXJtYX6himywHmeLa0C9GZQQIcXP1SlleBPSp3CBia4m95ZHiGKVBL0$>
>
>
>
> Please let me know if you have any questions.
>
>
>
> Best,
>
> Jack Ye
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Jun 20, 2023 at 9:02 AM Awasthi, Somesh <
> [email protected]> wrote:
>
> Hi Iceberg team,
>
>
>
> We want to know about how to perform read/write operation for iceberg
> table present AWS Glue Catalog.
>
>
>
> As of now we know there is two approaches which we know to connect with
> iceberg table.
>
>
>
> *Option1 JDBC:* Connect with Athena JDBC Driver
>
>
>
> *Option2 Spark:*  Connect with Apache Iceberg Spark Runtime
>
>
>
> Could you please let me know is there any other solution other than SPARK
> and JDBC API to connect with iceberg table for read/write operations.
>
>
>
> If there is an alternate way to perform read/write operation for iceberg
> table present AWS Glue Catalog so please provide me steps in details to
> implement it in sample program.
>
>
>
> Thanks,
>
> Somesh
>
>
>
>

Re: How to perform read/write operation for iceberg table present AWS Glue Catalog

Reply via email to