Hi all,

I have an iceberg datalake implemented on S3 where it contains
multiple tables, each table has its own partition schema. For these
tables daily data ingestion is performed via some operations. As a
result metadata in each table is growing rapidly including details in
partitions and files as well.

For another operation I need to get the file_paths for a specific
snapshot_id. For this, I am using the following query.
"""SELECT file_path FROM
catalog_metastore.<datalake_name>.<table_name>.files VERSION AS OF
<snapshot_id> where partition in (select partition from
catalog_metastore. <datalake_name>.<table_name>.partitions VERSION AS
OF <snapshot_id> where last_updated_snapshot_id=<snapshot_id>)"""

But going forward my secondary spark operation implemented using
pyspark which is executing on top of the EMR cluster where I get
file_paths of specific snapshot_id getting failed with below error.

java.lang.OutOfMemoryError: Java heap space
-XX.OnOutOfMemoryError="kill -9 %p"

Cluster Specs: 1 master node m5.2xlarge and 8 core nodes m5.2xlarge

Other than increasing the EMR Cluster specifications, what options do
I have to handle this scenario?

Thank You

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Reply via email to