Hi all, I have an iceberg datalake implemented on S3 where it contains multiple tables, each table has its own partition schema. For these tables daily data ingestion is performed via some operations. As a result metadata in each table is growing rapidly including details in partitions and files as well.
For another operation I need to get the file_paths for a specific snapshot_id. For this, I am using the following query. """SELECT file_path FROM catalog_metastore.<datalake_name>.<table_name>.files VERSION AS OF <snapshot_id> where partition in (select partition from catalog_metastore. <datalake_name>.<table_name>.partitions VERSION AS OF <snapshot_id> where last_updated_snapshot_id=<snapshot_id>)""" But going forward my secondary spark operation implemented using pyspark which is executing on top of the EMR cluster where I get file_paths of specific snapshot_id getting failed with below error. java.lang.OutOfMemoryError: Java heap space -XX.OnOutOfMemoryError="kill -9 %p" Cluster Specs: 1 master node m5.2xlarge and 8 core nodes m5.2xlarge Other than increasing the EMR Cluster specifications, what options do I have to handle this scenario? Thank You --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
