Hi,
RemoveOrphanFiles is working with only hadoop FS/IO and when run from local
with hadoop catalog. when i try to run it for S3 files using glue catalog
and from EMR. It throws the below error. i have tried with both iceberg
11,12 and also spark 3.0.1, spark 3.1.1 (all combinations) and also tried
both the commands from Actions API and also from Spark Actions API. the
result does not change.
Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute();
or
SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute();
and the error is
21/08/31 05:40:36 ERROR RemoveOrphanFilesMaintenanceJob: Error in
RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp,
Illegal Arguments in table properties - Can't parse null value from
table properties, tenant: tenantId1, table:
lakehouse_database.mobiletest1, removeOrphanFilesOlderThan:
1630388136606, Status: Failed, Reason: {}.
java.lang.IllegalArgumentException: Cannot find the metadata table for
glue_catalog.lakehouse_database.mobiletest1 of type ALL_MANIFESTS
at
org.apache.iceberg.spark.SparkTableUtil.loadMetadataTable(SparkTableUtil.java:634)
at
org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:153)
at
org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:119)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154)
at
org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
at
org.apache.iceberg.actions.RemoveOrphanFilesAction.execute(RemoveOrphanFilesAction.java:87)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:273)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:136)
at
com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:236)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
and i tried sql version of remove orphan files too and faced below error
sparkSession.sql("CALL
glue_catalog.lakehouse_database.remove_orphan_files(table =>
'db.mobiletest1')").show();
and the error is
Exception in thread "main"
org.apache.iceberg.exceptions.RuntimeIOException:
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for
scheme "s3"
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.listDirRecursively(BaseDeleteOrphanFilesSparkAction.java:236)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.buildActualFileDF(BaseDeleteOrphanFilesSparkAction.java:184)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:157)
at
org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.TestWriter.main(TestWriter.java:133)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No
FileSystem for scheme "s3"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.listDirRecursively(BaseDeleteOrphanFilesSparkAction.java:214)
Please help fix this problem for me. Is it something to do with my
implementation or is it a bug with an iceberg?
Thanks,
Raghu
On Fri, Aug 20, 2021 at 2:54 AM raghavendra186 <[email protected]>
wrote:
> Hi Guys,
>
> I am working with iceberg 11.1 version iceberg with spark 3.0.1 and when i
> run removeOrphanFiles either using Actions or SparkActions class and its
> functions it works with hadoop catalog when run locally and i face below
> exception when run on EMR with glue catalog. Could you please help me with
> what I am missing here?
>
> code snippet.
>
> Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute();
>
> or
>
> SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute();
>
> issue (when run on EMR):
>
> 21/08/19 08:12:56 INFO RemoveOrphanFilesMaintenanceJob: Running
> RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp,
> Status:Started, tenant: 1, table:raghu3.cars, removeOrphanFilesOlderThan:
> {1629360476572}.
>
> 21/08/19 08:12:56 ERROR RemoveOrphanFilesMaintenanceJob: Error in
> RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp,
> Illegal Arguments in table properties - Can't parse null value from table
> properties, tenant: tenantId1, table: raghu3.cars,
> removeOrphanFilesOlderThan: 1629360476572, Status: Failed, Reason: {}.
>
> java.lang.IllegalArgumentException: Cannot find the metadata table for
> glue_catalog.raghu3.cars of type ALL_MANIFESTS
> at
> org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:191)
> at
> org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:121)
> at
> org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154)
> at
> org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:101)
> at
> org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
> at
> org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
> at
> com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:274)
> at
> com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133)
> at
> com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58)
> at
> com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:117)
> at
> com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76)
> at
> com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:247)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
>
>
> Table does exists
>
> [image: image.png]
>
> Did any one face this? What is the fix? Is it a bug or am I missing something
> here?
>
> Thanks,
> Raghu
>