gbcoder2020 opened a new issue, #13522:
URL: https://github.com/apache/hudi/issues/13522

   
   **Describe the problem you faced**
   
   I'm creating a table using INSERT mode with record level index. I see the 
Spark job is failing with errors as:
   
   
![Image](https://github.com/user-attachments/assets/8f2ee022-3cf0-4f68-80bc-103b2fcd850d)
   
![Image](https://github.com/user-attachments/assets/c6f7b2a2-fbe1-47fc-a7f0-29980002a9fb)
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   ```
   data.write
             .format("hudi")
             .options(..)
             .mode("Overwrite")
             .save(<path>)
   ```
   
   
   Hudi options for insert:
   
   ```
   hoodie.metadata.record.index.max.filegroup.count -> 100000, 
   hoodie.embed.timeline.server -> false, 
   hoodie.parquet.small.file.limit -> 1073741824, 
   hoodie.insert.shuffle.parallelism -> 15800, 
   hoodie.metadata.record.index.enable -> true, 
   path -> <hudi table path>, 
   hoodie.datasource.write.precombine.field -> lut, 
   hoodie.datasource.write.payload.class -> 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload, 
hoodie.metadata.index.column.stats.enable -> true, 
   hoodie.parquet.max.file.size -> 2147483648, 
   hoodie.metadata.enable -> true, 
   hoodie.index.type -> RECORD_INDEX, 
   hoodie.datasource.write.operation -> insert, 
   hoodie.parquet.compression.codec -> snappy, 
   hoodie.datasource.write.recordkey.field -> <id>, 
   hoodie.table.name -> <table_name>, 
   hoodie.datasource.write.table.type -> COPY_ON_WRITE, 
   hoodie.datasource.write.hive_style_partitioning -> true, 
   hoodie.write.markers.type -> DIRECT, 
   hoodie.populate.meta.fields -> true, 
   hoodie.datasource.write.keygenerator.class -> 
org.apache.hudi.keygen.SimpleKeyGenerator, 
   hoodie.write.lock.provider -> 
org.apache.hudi.client.transaction.lock.InProcessLockProvider, 
   hoodie.datasource.write.partitionpath.field -> entityType, 
   hoodie.metadata.record.index.min.filegroup.count -> 5000, 
   hoodie.write.concurrency.mode -> SINGLE_WRITER
   ```
   
   **Expected behavior**
   
   Successful insert into hudi table
   
   **Environment Description**
   
   * Hudi version : 0.15.0
   
   * Spark version : 3.4
   
   * Hive version : NA
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Please help understand why this problem may be occurring with details around 
how the 
   
   **Stacktrace**
   
   ```
   Building workload profile: <table>
   countByKey at HoodieJavaPairRDD.java:105+details
   RDD: MapPartitionsRDD
   org.apache.spark.api.java.JavaPairRDD.countByKey(JavaPairRDD.scala:314)
   org.apache.hudi.data.HoodieJavaPairRDD.countByKey(HoodieJavaPairRDD.java:105)
   
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.buildProfile(BaseSparkCommitActionExecutor.java:197)
   
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:168)
   
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:85)
   
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:58)
   
org.apache.hudi.table.action.commit.SparkInsertCommitActionExecutor.execute(SparkInsertCommitActionExecutor.java:44)
   
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.insert(HoodieSparkCopyOnWriteTable.java:114)
   
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.insert(HoodieSparkCopyOnWriteTable.java:98)
   
org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:182)
   org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:219)
   
org.apache.hudi.HoodieSparkSqlWriterInternal.liftedTree1$1(HoodieSparkSqlWriter.scala:492)
   
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:490)
   
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
   org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
   org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
   
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
   
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
   
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
   
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
   ```
   
   ```
   Job aborted due to stage failure: Task 427 in stage 39.0 failed 4 times, 
most recent failure: Lost task 427.3 in stage 39.0 (TID 162985) (<ip>executor 
853): ExecutorLostFailure (executor 853 exited caused by one of the running 
tasks) Reason: Container from a bad node: 
container_1749069924658_0001_01_001944 on host: <ip>. Exit status: 143. 
Diagnostics: [2025-06-05 00:16:15.459]Container killed on request. Exit code is 
143
   ```
   
   ```
   3134.968: [GC concurrent-string-deduplication, 944.0B->808.0B(136.0B), avg 
63.1%, 0.0000343 secs]
   #
   # java.lang.OutOfMemoryError: Java heap space
   # -XX:OnOutOfMemoryError="kill %p"
   #   Executing /bin/sh -c "kill 15018"...
   Heap
    garbage-first heap   total 55574528K, used 47555500K [0x00007f2c78000000, 
0x00007f2c7880d400, 0x00007f39b8000000)
     region size 8192K, 59 young (483328K), 0 survivors (0K)
    Metaspace       used 109421K, capacity 115699K, committed 135680K, reserved 
137216K
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to