rohit-m-99 commented on issue #6015:
URL: https://github.com/apache/hudi/issues/6015#issuecomment-1172007450

   Was able to successfully run the job by 
   
   1. Downgrading from Spark 3.2.1 to 3.1.2
   2. Using hadoop version 3.2.0
   3. Using hudi-utilities bundle exclusively in the deltastreamer
   4. Exclusively using the insert operation
   
   ```
   #!/bin/bash
   spark-submit \
   --jars 
opt/spark/jars/hudi-utilities-bundle.jar,/opt/spark/jars/hadoop-aws.jar,/opt/spark/jars/aws-java-sdk.jar
 \
   --master spark://spark-master:7077 \
   --total-executor-cores 10 \
   --executor-memory 4g \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
opt/spark/jars/hudi-utilities-bundle.jar \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
   --target-table per_tick_stats \
   --table-type COPY_ON_WRITE \
   --min-sync-interval-seconds 30 \
   --source-limit 250000000 \
   --continuous \
   --source-ordering-field $3 \
   --target-base-path $2 \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=$1 \
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
 \
   --hoodie-conf hoodie.datasource.write.recordkey.field=$4 \
   --hoodie-conf hoodie.datasource.write.precombine.field=$3 \
   --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=$5 \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=$6 \
   --hoodie-conf hoodie.clustering.inline=true \
   --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=100000000 \
   --hoodie-conf hoodie.clustering.inline.max.commits=4 \
   --hoodie-conf hoodie.metadata.enable=true \
   --hoodie-conf hoodie.metadata.index.column.stats.enable=true \
   --op INSERT
   ```
   
   However this results in successful ingestion but is still pretty slow. See 
following operation:
   
   Given the 250MB source limit seems like only inserting shouldn't be taking 
on the order of 12 minutes?
   
   <img width="1432" alt="image" 
src="https://user-images.githubusercontent.com/84733594/176841469-548ea774-5485-4b03-bc3a-1252519cf011.png";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to