Thanks! Seems I wasn’t too far off then. It’s my understanding that because
we’re using EMRFS consistent view, we should not use S3FileIO or the emrfs
metadata will get out of sync, but it doesn’t seem like this catalog works with
HadoopFileIO so far in my basic testing. I get a NullPointerException because
the Hadoop configuration isn’t passed along at some point.
I noticed that I needed to call `setConf()` to get the Hadoop configs into the
catalog object.
Map<String, String> props = ImmutableMap.of(
"type", "iceberg",
"warehouse", config.getOutputDir(),
"lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager",
"lock.table", config.getDynamoIcebergLocksTable(),
"io-impl", "org.apache.iceberg.hadoop.HadoopFileIO"
);
this.icebergCatalog.initialize("iceberg", props);
this.icebergCatalog.setConf(spark.sparkContext().hadoopConfiguration());
Then when I call createTable later:
java.lang.NullPointerException
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:481)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.iceberg.hadoop.Util.getFs(Util.java:48)
at
org.apache.iceberg.hadoop.HadoopOutputFile.fromPath(HadoopOutputFile.java:53)
at
org.apache.iceberg.hadoop.HadoopFileIO.newOutputFile(HadoopFileIO.java:64)
at
org.apache.iceberg.BaseMetastoreTableOperations.writeNewMetadata(BaseMetastoreTableOperations.java:137)
at
org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:105)
at
org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:118)
at
org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.create(BaseMetastoreCatalog.java:215)
at
org.apache.iceberg.BaseMetastoreCatalog.createTable(BaseMetastoreCatalog.java:48)
at
org.apache.iceberg.catalog.Catalog.createTable(Catalog.java:105)
The NPE is because `conf` is null in that method, but I verified that
icebergCatalog.hadoopConf is the expected object.
Should it be expected that the GlueCatalog can be used with HadoopFileIO or is
it only compatible with S3FileIO?
Greg
From: Jack Ye <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, July 7, 2021 at 4:16 PM
To: Iceberg Dev List <[email protected]>
Subject: Re: GlueCatalog example?
This message was identified as a phishing scam.
Yeah this is actually a good point, the documentation is mostly around loading
the catalog to different SQL engines and lacks Java API examples. The
integration tests are good places to see Java examples:
https://github.com/apache/iceberg/blob/master/aws/src/integration/java/org/apache/iceberg/aws/glue/GlueTestBase.java<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Faws%2Fsrc%2Fintegration%2Fjava%2Forg%2Fapache%2Ficeberg%2Faws%2Fglue%2FGlueTestBase.java&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168256361%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dV9Uvdbm4ogsuvADlri%2FuWt2xAuBVA56%2BI8%2Bj3mRs1Y%3D&reserved=0>
-Jack Ye
On Wed, Jul 7, 2021 at 1:27 PM Greg Hill <[email protected]> wrote:
Is there a Java example for the proper way to get the GlueCatalog object? We
are trying to convert from HadoopTables and need access to the lower-level APIs
to create and update tables with partitions.
I’m looking for something similar to these examples for HadoopTables and
HiveCatalog:
https://iceberg.apache.org/java-api-quickstart/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fjava-api-quickstart%2F&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168266327%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=luUUropvT0UFzgyVtGjmdosqyf%2BFpRpM3oL0Pnu9tK8%3D&reserved=0>
From what I can gather looking at the code, this is what I came up with (our
catalog name is `iceberg`), but it feels like there’s probably a better way
that I’m not seeing:
this.icebergCatalog = new GlueCatalog();
Configuration conf = spark.sparkContext().hadoopConfiguration();
Map<String, String> props = ImmutableMap.of(
"type", conf.get("spark.sql.catalog.iceberg.type"),
"warehouse", conf.get("spark.sql.catalog.iceberg.warehouse"),
"lock-impl", conf.get("spark.sql.catalog.iceberg.lock-impl"),
"lock.table", conf.get("spark.sql.catalog.iceberg.lock.table"),
"io-impl", conf.get("spark.sql.catalog.iceberg.io-impl")
);
this.icebergCatalog.initialize("iceberg", props);
Sorry for the potentially n00b question, but I’m a n00b 😃
Greg