kazdy commented on issue #3724: URL: https://github.com/apache/hudi/issues/3724#issuecomment-986309728
I'm closing this issue after reading the Hudi code. Hudi Incremental Query reads data only if available in commits, so you're not going to get all the data from the table (which was my concern) as the commit files are removed with new data coming to the table (depending on the configuration). You just can't read stream from the beginning of the table in all cases. For the newcomers reading the Spark Guide Incremental query section, this is not obvious. Spark Structured Streaming is not documented at all, this is something that needs to be improved. Incremental query behavior that I was confused about is explained well here: https://hudi.apache.org/docs/configurations/#cleanretain_commits I think Hudi still is missing some functionality when it comes to the Spark Structured Streaming: - readStream from a given point in time, - readStream to a given point in time, - maxBytesPerTrigger, - maxRecordsPerTrigger. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org