[GitHub] [hudi] kazdy commented on issue #3724: [SUPPORT] Spark start reading stream from hudi dataset starting from given commit time

GitBox Sun, 05 Dec 2021 14:12:20 -0800


kazdy commented on issue #3724:
URL: https://github.com/apache/hudi/issues/3724#issuecomment-986309728



   I'm closing this issue after reading the Hudi code. 
   
   Hudi Incremental Query reads data only if available in commits, so you're 
not going to get all the data from the table (which was my concern) as the 
commit files are removed with new data coming to the table (depending on the 
configuration). You just can't read stream from the beginning of the table in 
all cases.
   
   For the newcomers reading the Spark Guide Incremental query section, this is 
not obvious. 
   Spark Structured Streaming is not documented at all, this is something that 
needs to be improved.
   
   Incremental query behavior that I was confused about is explained well here:
   https://hudi.apache.org/docs/configurations/#cleanretain_commits
   
   I think Hudi still is missing some functionality when it comes to the Spark 
Structured Streaming:
   - readStream from a given point in time,
   - readStream to a given point in time,
   - maxBytesPerTrigger,
   - maxRecordsPerTrigger.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] kazdy commented on issue #3724: [SUPPORT] Spark start reading stream from hudi dataset starting from given commit time

Reply via email to