[GitHub] [hudi] veenaypatil opened a new issue, #6014: [SUPPORT] High runtime for a batch in SparkWriteHelper stage

GitBox Thu, 30 Jun 2022 07:39:40 -0700


veenaypatil opened a new issue, #6014:
URL: https://github.com/apache/hudi/issues/6014


   **Describe the problem you faced**
   
   We are observing higher run times for a batch , it took 15hr plus to 
complete single batch, the subsequent batches are running fine. The dataset in 
question is not big. Attaching few screenshots for reference, GC times are less.
   hoodieConfigs for reference
   
   <img width="1780" alt="Screenshot 2022-06-29 at 10 04 10 PM" 
src="https://user-images.githubusercontent.com/52563354/176704158-598c2a7a-f090-4481-8d5c-40df8bff9235.png";>
   <img width="1780" alt="Screenshot 2022-06-29 at 10 06 53 PM" 
src="https://user-images.githubusercontent.com/52563354/176704200-29ef41de-d0f0-49e9-82bd-aebfae4c0b5f.png";>
   <img width="1780" alt="Screenshot 2022-06-29 at 10 08 11 PM" 
src="https://user-images.githubusercontent.com/52563354/176704211-3662aa4c-07d1-48d3-a757-ff2921729258.png";>
   
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.10.1
   
   * Spark version : 3.0.3
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.2
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : NO
   
   
   **Additional context**
   
   Hudi Configs 
   
   ```
   hoodieConfigs:
     hoodie.datasource.write.operation: upsert
     hoodie.datasource.write.table.type: MERGE_ON_READ
     hoodie.datasource.write.partitionpath.field: ""
     hoodie.datasource.write.keygenerator.class: 
org.apache.hudi.keygen.NonpartitionedKeyGenerator
     hoodie.metrics.on: true
     hoodie.metrics.reporter.type: CLOUDWATCH
     hoodie.datasource.hive_sync.partition_extractor_class: 
org.apache.hudi.hive.NonPartitionedExtractor
     hoodie.parquet.max.file.size: 6110612736
     hoodie.compact.inline: true
     hoodie.clean.automatic: true
     hoodie.compact.inline.trigger.strategy: NUM_AND_TIME
     hoodie.clean.async: true
     hoodie.cleaner.policy: KEEP_LATEST_COMMITS
     hoodie.cleaner.commits.retained: 120
     hoodie.keep.min.commits: 130
     hoodie.keep.max.commits: 131
   ```
   
   Spark Job configs
   
   ```
   {
     "className": "com.hotstar.driver.CdcCombinedDriver",
     "proxyUser": "root",
     "driverCores": 1,
     "executorCores": 4,
     "executorMemory": "4G",
     "driverMemory": "4G",
     "queue": "cdc",
     "name": "hudiJob",
     "file": "s3a://bucket/jars/prod.jar",
     "conf": {
       "spark.eventLog.enabled": "false",
       "spark.ui.enabled": "true",
       "spark.streaming.concurrentJobs": "1",
       "spark.streaming.backpressure.enabled": "false",
       "spark.streaming.kafka.maxRatePerPartition": "500",
       "spark.yarn.am.nodeLabelExpression": "cdc",
       "spark.shuffle.service.enabled": "true",
       "spark.driver.maxResultSize": "8g",
       "spark.driver.memoryOverhead": "2048",
       "spark.executor.memoryOverhead": "2048",
       "spark.dynamicAllocation.enabled": "true",
       "spark.dynamicAllocation.minExecutors": "25",
       "spark.dynamicAllocation.maxExecutors": "50",
       "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2",
       "spark.jars.packages": 
"org.apache.spark:spark-avro_2.12:3.0.2,com.izettle:metrics-influxdb:1.2.3",
       "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
       "spark.rdd.compress": "true",
       "spark.sql.hive.convertMetastoreParquet": "false",
       "spark.yarn.maxAppAttempts": "1",
       "spark.task.cpus": "1"
     }
   }
   ```
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] veenaypatil opened a new issue, #6014: [SUPPORT] High runtime for a batch in SparkWriteHelper stage

Reply via email to