reste85 opened a new issue #1598:
URL: https://github.com/apache/incubator-hudi/issues/1598


   Hi all,
   We're experiencing strange issues using deltastreamer with hudi 0.5.2 
version.
   We're reading from a Kafka source, in particular from a compacted topic with 
50 partitions. We're partitioning via a custom KeyResolver which basically is 
partitioning similarly to Kafka (murmur3hash(recordKey) mod n°_of_partitions).
   What we see is that during the first three runs everything goes smoothly 
(each run ingests 5mln records). At the fourth run, suddenly the process really 
slows down.
   Speaking about job stages, we saw that the countByKey is the step that is 
taking too long, with low cluster usage/load (it is shuffling?)
   
   Here the hudi properties we're using:
   `# Hoodie properties
   hoodie.upsert.shuffle.parallelism=5
   hoodie.insert.shuffle.parallelism=5
   hoodie.bulkinsert.shuffle.parallelism=5
   hoodie.embed.timeline.server=true
   hoodie.filesystem.view.type=EMBEDDED_KV_STORE
   hoodie.compact.inline=false
   hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
   hoodie.clean.automatic=true
   hoodie.combine.before.upsert=true
   hoodie.cleaner.fileversions.retained=1
   hoodie.bloom.index.prune.by.ranges=false
   hoodie.index.bloom.num_entries=1000000`
   
   Last run (the one is taking too long):
   <img width="1662" alt="Screenshot 2020-05-07 at 15 32 24" 
src="https://user-images.githubusercontent.com/14905251/81303030-749c5100-907b-11ea-84c6-59bb10d2f48d.png";>
   
   <img width="1664" alt="Screenshot 2020-05-07 at 15 32 32" 
src="https://user-images.githubusercontent.com/14905251/81303064-81b94000-907b-11ea-8ac3-b0b75b2be443.png";>
   
   
   
   First, second and third run (that went very well):
   <img width="1674" alt="firsrun" 
src="https://user-images.githubusercontent.com/14905251/81303100-8ed62f00-907b-11ea-8a82-79a171f32b31.png";>
   <img width="1657" alt="secondrun" 
src="https://user-images.githubusercontent.com/14905251/81303109-91388900-907b-11ea-9ea7-dbf9ca4cf33a.png";>
   <img width="1667" alt="thirdrun" 
src="https://user-images.githubusercontent.com/14905251/81303117-939ae300-907b-11ea-9c37-857d34cf492d.png";>
   
   
   thank you in advance!
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to