Hi,

RemoveOrphanFiles is working with only hadoop FS/IO and when run from local
with hadoop catalog. when i try to run it for S3 files using glue catalog
and from EMR. It throws the below error. i have tried with both iceberg
11,12 and also spark 3.0.1, spark 3.1.1 (all combinations) and also tried
both the commands from Actions API and also from Spark Actions API. the
result does not change.

Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute();

or

SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute();

and the error is

21/08/31 05:40:36 ERROR RemoveOrphanFilesMaintenanceJob: Error in
RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp,
Illegal Arguments in table properties - Can't parse null value from
table properties, tenant: tenantId1, table:
lakehouse_database.mobiletest1, removeOrphanFilesOlderThan:
1630388136606, Status: Failed, Reason: {}.
java.lang.IllegalArgumentException: Cannot find the metadata table for
glue_catalog.lakehouse_database.mobiletest1 of type ALL_MANIFESTS
        at 
org.apache.iceberg.spark.SparkTableUtil.loadMetadataTable(SparkTableUtil.java:634)
        at 
org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:153)
        at 
org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:119)
        at 
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154)
        at 
org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99)
        at 
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
        at 
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
        at 
org.apache.iceberg.actions.RemoveOrphanFilesAction.execute(RemoveOrphanFilesAction.java:87)
        at 
com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:273)
        at 
com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133)
        at 
com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58)
        at 
com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:136)
        at 
com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76)
        at 
com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:236)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)

and i tried sql version of remove orphan files too and faced below error

sparkSession.sql("CALL
glue_catalog.lakehouse_database.remove_orphan_files(table =>
'db.mobiletest1')").show();

and the error is

Exception in thread "main"
org.apache.iceberg.exceptions.RuntimeIOException:
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for
scheme "s3"
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.listDirRecursively(BaseDeleteOrphanFilesSparkAction.java:236)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.buildActualFileDF(BaseDeleteOrphanFilesSparkAction.java:184)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:157)
at
org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.TestWriter.main(TestWriter.java:133)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No
FileSystem for scheme "s3"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.listDirRecursively(BaseDeleteOrphanFilesSparkAction.java:214)

Please help fix this problem for me. Is it something to do with my
implementation or is it a bug with an iceberg?

Thanks,
Raghu

On Fri, Aug 20, 2021 at 2:54 AM raghavendra186 <raghu.st...@gmail.com>
wrote:

> Hi Guys,
>
> I am working with iceberg 11.1 version iceberg with spark 3.0.1 and when i
> run removeOrphanFiles either using Actions or SparkActions class and its
> functions it works with hadoop catalog when run locally and i face below
> exception when run on EMR with glue catalog. Could you please help me with
> what I am missing here?
>
> code snippet.
>
> Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute();
>
> or
>
> SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute();
>
> issue (when run on EMR):
>
> 21/08/19 08:12:56 INFO RemoveOrphanFilesMaintenanceJob: Running 
> RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, 
> Status:Started, tenant: 1, table:raghu3.cars, removeOrphanFilesOlderThan: 
> {1629360476572}.
>
> 21/08/19 08:12:56 ERROR RemoveOrphanFilesMaintenanceJob: Error in 
> RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, 
> Illegal Arguments in table properties - Can't parse null value from table 
> properties, tenant: tenantId1, table: raghu3.cars, 
> removeOrphanFilesOlderThan: 1629360476572, Status: Failed, Reason: {}.
>
> java.lang.IllegalArgumentException: Cannot find the metadata table for 
> glue_catalog.raghu3.cars of type ALL_MANIFESTS
>       at 
> org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:191)
>       at 
> org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:121)
>       at 
> org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154)
>       at 
> org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:101)
>       at 
> org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
>       at 
> org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
>       at 
> com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:274)
>       at 
> com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133)
>       at 
> com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58)
>       at 
> com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:117)
>       at 
> com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76)
>       at 
> com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:247)
>       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>       at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
>
>
> Table does exists
>
> [image: image.png]
>
> Did any one face this? What is the fix? Is it a bug or am I missing something 
> here?
>
> Thanks,
> Raghu
>

Reply via email to