You’re likely hitting Spark driver/executor OOM while querying Iceberg
metadata, not the data itself.
Instead of the subquery inside your where - query it separately. Split it
into two separate queries.

Best Regards
Soumasish Goswami
in: www.linkedin.com/in/soumasish
# (415) 530-0405

   -



On Wed, Sep 24, 2025 at 9:43 AM Nipuna Shantha <tnshantha...@gmail.com>
wrote:

> Hi all,
>
> I have an iceberg datalake implemented on S3 where it contains
> multiple tables, each table has its own partition schema. For these
> tables daily data ingestion is performed via some operations. As a
> result metadata in each table is growing rapidly including details in
> partitions and files as well.
>
> For another operation I need to get the file_paths for a specific
> snapshot_id. For this, I am using the following query.
> """SELECT file_path FROM
> catalog_metastore.<datalake_name>.<table_name>.files VERSION AS OF
> <snapshot_id> where partition in (select partition from
> catalog_metastore. <datalake_name>.<table_name>.partitions VERSION AS
> OF <snapshot_id> where last_updated_snapshot_id=<snapshot_id>)"""
>
> But going forward my secondary spark operation implemented using
> pyspark which is executing on top of the EMR cluster where I get
> file_paths of specific snapshot_id getting failed with below error.
>
> java.lang.OutOfMemoryError: Java heap space
> -XX.OnOutOfMemoryError="kill -9 %p"
>
> Cluster Specs: 1 master node m5.2xlarge and 8 core nodes m5.2xlarge
>
> Other than increasing the EMR Cluster specifications, what options do
> I have to handle this scenario?
>
> Thank You
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to