rohit-m-99 commented on issue #6015: URL: https://github.com/apache/hudi/issues/6015#issuecomment-1172007450
Was able to successfully run the job by 1. Downgrading from Spark 3.2.1 to 3.1.2 2. Using hadoop version 3.2.0 3. Using hudi-utilities bundle exclusively in the deltastreamer 4. Exclusively using the insert operation ``` #!/bin/bash spark-submit \ --jars opt/spark/jars/hudi-utilities-bundle.jar,/opt/spark/jars/hadoop-aws.jar,/opt/spark/jars/aws-java-sdk.jar \ --master spark://spark-master:7077 \ --total-executor-cores 10 \ --executor-memory 4g \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer opt/spark/jars/hudi-utilities-bundle.jar \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-table per_tick_stats \ --table-type COPY_ON_WRITE \ --min-sync-interval-seconds 30 \ --source-limit 250000000 \ --continuous \ --source-ordering-field $3 \ --target-base-path $2 \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=$1 \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \ --hoodie-conf hoodie.datasource.write.recordkey.field=$4 \ --hoodie-conf hoodie.datasource.write.precombine.field=$3 \ --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=$5 \ --hoodie-conf hoodie.datasource.write.partitionpath.field=$6 \ --hoodie-conf hoodie.clustering.inline=true \ --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=100000000 \ --hoodie-conf hoodie.clustering.inline.max.commits=4 \ --hoodie-conf hoodie.metadata.enable=true \ --hoodie-conf hoodie.metadata.index.column.stats.enable=true \ --op INSERT ``` However this results in successful ingestion but is still pretty slow. See following operation: Given the 250MB source limit seems like only inserting shouldn't be taking on the order of 12 minutes? <img width="1432" alt="image" src="https://user-images.githubusercontent.com/84733594/176841469-548ea774-5485-4b03-bc3a-1252519cf011.png"> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
