worf0815 opened a new issue #5054: URL: https://github.com/apache/hudi/issues/5054
**Describe the problem you faced** We are trying to ingest and deduplicate via Hudi a table with a total record size of 25 billion where each record is about 3-4kb size (there are even larger tables in our portfolio with the largest ingesting 1 - 7 billion records daily with a total volume of 221 billion ). Above table ran into memory issues with AWS Glue 3 and failed in the "countByKey - Building Workload Profile" stage with "org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 26), which maintains the block data to fetch is dead." in the sparkui logs. **To Reproduce** Steps to reproduce the behavior: 1. Start Glue 3.0 Job with 80 G1.X workers, reading from a standard glue catalog table where files are stored on s3 2. Without specifying a bounded context of roughly 7GB in Glue the job fails with out of memory issue 3. I also tried it with Glue2.0 and spill_to_s3 enabled, which resulted in nearly 3 TB of spilling... **Expected behavior** If possible during upsert a larger number of records should be processible with 80 G1.X workers **Environment Description** * Hudi version : 0.9.0 (via AWS Glue Connector) * Spark version : 3.1.1 (AWS Glue) * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** The original complete dataset is about 420GB snappy compressed parquet files. Hudi configuration with or without memory fraction did not caused a difference, Partitioncolumns and Recordkeys consist of multiple columns: ``` commonConfig = { "className": "org.apache.hudi", "hoodie.table.name": hudi_table_name, "path": f"s3://upsert-poc/hudie/default/{hudi_table_name}", "hoodie.datasource.write.precombine.field": "update_date", "hoodie.datasource.write.partitionpath.field": partition_fields, "hoodie.datasource.write.recordkey.field": primary_keys, "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.support_timestamp": "true", "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.database": hudi_database, "hoodie.datasource.hive_sync.table": hudi_table_name, "hoodie.datasource.hive_sync.partition_fields": partition_fields, "hoodie.datasource.hive_sync.mode":"hms", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", } incrementalConfig = { "hoodie.datasource.write.operation": "upsert", "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS", "hoodie.cleaner.commits.retained": 1, } ``` **Stacktrace** SparkUI LogView  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org