Hi, RemoveOrphanFiles is working with only hadoop FS/IO and when run from local with hadoop catalog. when i try to run it for S3 files using glue catalog and from EMR. It throws the below error. i have tried with both iceberg 11,12 and also spark 3.0.1, spark 3.1.1 (all combinations) and also tried both the commands from Actions API and also from Spark Actions API. the result does not change.
Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute(); or SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute(); and the error is 21/08/31 05:40:36 ERROR RemoveOrphanFilesMaintenanceJob: Error in RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, Illegal Arguments in table properties - Can't parse null value from table properties, tenant: tenantId1, table: lakehouse_database.mobiletest1, removeOrphanFilesOlderThan: 1630388136606, Status: Failed, Reason: {}. java.lang.IllegalArgumentException: Cannot find the metadata table for glue_catalog.lakehouse_database.mobiletest1 of type ALL_MANIFESTS at org.apache.iceberg.spark.SparkTableUtil.loadMetadataTable(SparkTableUtil.java:634) at org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:153) at org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:119) at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154) at org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99) at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141) at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76) at org.apache.iceberg.actions.RemoveOrphanFilesAction.execute(RemoveOrphanFilesAction.java:87) at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:273) at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133) at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58) at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:136) at com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76) at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:236) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735) and i tried sql version of remove orphan files too and faced below error sparkSession.sql("CALL glue_catalog.lakehouse_database.remove_orphan_files(table => 'db.mobiletest1')").show(); and the error is Exception in thread "main" org.apache.iceberg.exceptions.RuntimeIOException: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3" at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.listDirRecursively(BaseDeleteOrphanFilesSparkAction.java:236) at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.buildActualFileDF(BaseDeleteOrphanFilesSparkAction.java:184) at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:157) at org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99) at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141) at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76) at com.salesforce.cdp.lakehouse.spark.tablemaintenance.TestWriter.main(TestWriter.java:133) Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3" at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.listDirRecursively(BaseDeleteOrphanFilesSparkAction.java:214) Please help fix this problem for me. Is it something to do with my implementation or is it a bug with an iceberg? Thanks, Raghu On Fri, Aug 20, 2021 at 2:54 AM raghavendra186 <raghu.st...@gmail.com> wrote: > Hi Guys, > > I am working with iceberg 11.1 version iceberg with spark 3.0.1 and when i > run removeOrphanFiles either using Actions or SparkActions class and its > functions it works with hadoop catalog when run locally and i face below > exception when run on EMR with glue catalog. Could you please help me with > what I am missing here? > > code snippet. > > Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute(); > > or > > SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute(); > > issue (when run on EMR): > > 21/08/19 08:12:56 INFO RemoveOrphanFilesMaintenanceJob: Running > RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, > Status:Started, tenant: 1, table:raghu3.cars, removeOrphanFilesOlderThan: > {1629360476572}. > > 21/08/19 08:12:56 ERROR RemoveOrphanFilesMaintenanceJob: Error in > RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, > Illegal Arguments in table properties - Can't parse null value from table > properties, tenant: tenantId1, table: raghu3.cars, > removeOrphanFilesOlderThan: 1629360476572, Status: Failed, Reason: {}. > > java.lang.IllegalArgumentException: Cannot find the metadata table for > glue_catalog.raghu3.cars of type ALL_MANIFESTS > at > org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:191) > at > org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:121) > at > org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154) > at > org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:101) > at > org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141) > at > org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76) > at > com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:274) > at > com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133) > at > com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58) > at > com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:117) > at > com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76) > at > com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:247) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735) > > > Table does exists > > [image: image.png] > > Did any one face this? What is the fix? Is it a bug or am I missing something > here? > > Thanks, > Raghu >