veenaypatil opened a new issue, #6014: URL: https://github.com/apache/hudi/issues/6014
**Describe the problem you faced** We are observing higher run times for a batch , it took 15hr plus to complete single batch, the subsequent batches are running fine. The dataset in question is not big. Attaching few screenshots for reference, GC times are less. hoodieConfigs for reference <img width="1780" alt="Screenshot 2022-06-29 at 10 04 10 PM" src="https://user-images.githubusercontent.com/52563354/176704158-598c2a7a-f090-4481-8d5c-40df8bff9235.png"> <img width="1780" alt="Screenshot 2022-06-29 at 10 06 53 PM" src="https://user-images.githubusercontent.com/52563354/176704200-29ef41de-d0f0-49e9-82bd-aebfae4c0b5f.png"> <img width="1780" alt="Screenshot 2022-06-29 at 10 08 11 PM" src="https://user-images.githubusercontent.com/52563354/176704211-3662aa4c-07d1-48d3-a757-ff2921729258.png"> **To Reproduce** Steps to reproduce the behavior: 1. 2. 3. 4. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.10.1 * Spark version : 3.0.3 * Hive version : 3.1.2 * Hadoop version : 3.2.2 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : NO **Additional context** Hudi Configs ``` hoodieConfigs: hoodie.datasource.write.operation: upsert hoodie.datasource.write.table.type: MERGE_ON_READ hoodie.datasource.write.partitionpath.field: "" hoodie.datasource.write.keygenerator.class: org.apache.hudi.keygen.NonpartitionedKeyGenerator hoodie.metrics.on: true hoodie.metrics.reporter.type: CLOUDWATCH hoodie.datasource.hive_sync.partition_extractor_class: org.apache.hudi.hive.NonPartitionedExtractor hoodie.parquet.max.file.size: 6110612736 hoodie.compact.inline: true hoodie.clean.automatic: true hoodie.compact.inline.trigger.strategy: NUM_AND_TIME hoodie.clean.async: true hoodie.cleaner.policy: KEEP_LATEST_COMMITS hoodie.cleaner.commits.retained: 120 hoodie.keep.min.commits: 130 hoodie.keep.max.commits: 131 ``` Spark Job configs ``` { "className": "com.hotstar.driver.CdcCombinedDriver", "proxyUser": "root", "driverCores": 1, "executorCores": 4, "executorMemory": "4G", "driverMemory": "4G", "queue": "cdc", "name": "hudiJob", "file": "s3a://bucket/jars/prod.jar", "conf": { "spark.eventLog.enabled": "false", "spark.ui.enabled": "true", "spark.streaming.concurrentJobs": "1", "spark.streaming.backpressure.enabled": "false", "spark.streaming.kafka.maxRatePerPartition": "500", "spark.yarn.am.nodeLabelExpression": "cdc", "spark.shuffle.service.enabled": "true", "spark.driver.maxResultSize": "8g", "spark.driver.memoryOverhead": "2048", "spark.executor.memoryOverhead": "2048", "spark.dynamicAllocation.enabled": "true", "spark.dynamicAllocation.minExecutors": "25", "spark.dynamicAllocation.maxExecutors": "50", "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2", "spark.jars.packages": "org.apache.spark:spark-avro_2.12:3.0.2,com.izettle:metrics-influxdb:1.2.3", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.rdd.compress": "true", "spark.sql.hive.convertMetastoreParquet": "false", "spark.yarn.maxAppAttempts": "1", "spark.task.cpus": "1" } } ``` **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
