You’re likely hitting Spark driver/executor OOM while querying Iceberg metadata, not the data itself. Instead of the subquery inside your where - query it separately. Split it into two separate queries.
Best Regards Soumasish Goswami in: www.linkedin.com/in/soumasish # (415) 530-0405 - On Wed, Sep 24, 2025 at 9:43 AM Nipuna Shantha <tnshantha...@gmail.com> wrote: > Hi all, > > I have an iceberg datalake implemented on S3 where it contains > multiple tables, each table has its own partition schema. For these > tables daily data ingestion is performed via some operations. As a > result metadata in each table is growing rapidly including details in > partitions and files as well. > > For another operation I need to get the file_paths for a specific > snapshot_id. For this, I am using the following query. > """SELECT file_path FROM > catalog_metastore.<datalake_name>.<table_name>.files VERSION AS OF > <snapshot_id> where partition in (select partition from > catalog_metastore. <datalake_name>.<table_name>.partitions VERSION AS > OF <snapshot_id> where last_updated_snapshot_id=<snapshot_id>)""" > > But going forward my secondary spark operation implemented using > pyspark which is executing on top of the EMR cluster where I get > file_paths of specific snapshot_id getting failed with below error. > > java.lang.OutOfMemoryError: Java heap space > -XX.OnOutOfMemoryError="kill -9 %p" > > Cluster Specs: 1 master node m5.2xlarge and 8 core nodes m5.2xlarge > > Other than increasing the EMR Cluster specifications, what options do > I have to handle this scenario? > > Thank You > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >