worf0815 opened a new issue #5054:
URL: https://github.com/apache/hudi/issues/5054


   **Describe the problem you faced**
   
   We are trying to ingest and deduplicate via Hudi a table with a total record 
size of 25 billion where each record is about 3-4kb size (there are even larger 
tables in our portfolio with the largest ingesting 1 - 7 billion records daily 
with a total volume of 221 billion ). 
   
   Above table ran into memory issues with AWS Glue 3 and failed in the 
"countByKey - Building Workload Profile" stage with 
"org.apache.spark.shuffle.FetchFailedException: The relative remote 
executor(Id: 26), which maintains the block data to fetch is dead." in the 
sparkui logs.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Start Glue 3.0 Job with 80 G1.X workers, reading from a standard glue 
catalog table where files are stored on s3
   2. Without specifying a bounded context of roughly 7GB in Glue the job fails 
with out of memory issue
   3. I also tried it with Glue2.0 and spill_to_s3 enabled, which resulted in 
nearly 3 TB of spilling...
   
   **Expected behavior**
   
   If possible during upsert a larger number of records should be processible 
with 80 G1.X workers
   
   **Environment Description**
   
   * Hudi version : 0.9.0 (via AWS Glue Connector)
   
   * Spark version : 3.1.1 (AWS Glue)
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   The original complete dataset is about 420GB snappy compressed parquet files.
   Hudi configuration with or without memory fraction did not caused a 
difference, Partitioncolumns and Recordkeys consist of multiple columns:
   
   ```
   commonConfig = {
       "className": "org.apache.hudi",
       "hoodie.table.name": hudi_table_name,
       "path": f"s3://upsert-poc/hudie/default/{hudi_table_name}",
       "hoodie.datasource.write.precombine.field": "update_date",
       "hoodie.datasource.write.partitionpath.field": partition_fields,
       "hoodie.datasource.write.recordkey.field": primary_keys,
       "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.datasource.hive_sync.enable": "true",
       "hoodie.datasource.hive_sync.support_timestamp": "true",
       "hoodie.datasource.hive_sync.use_jdbc": "false",
       "hoodie.datasource.hive_sync.database": hudi_database,
       "hoodie.datasource.hive_sync.table": hudi_table_name,
       "hoodie.datasource.hive_sync.partition_fields": partition_fields,
       "hoodie.datasource.hive_sync.mode":"hms",
       "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   }
   
   incrementalConfig = {
       "hoodie.datasource.write.operation": "upsert",
       "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS",
       "hoodie.cleaner.commits.retained": 1,
   }
   ```
   
   **Stacktrace** 
   SparkUI LogView
   
   
![grafik](https://user-images.githubusercontent.com/10959555/158626781-2e67516f-84b9-409f-838a-70bde86861e0.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to