Re: [I] use commit_time in the WHERE STATEMENT to optimize the incremental query [hudi]

via GitHub Thu, 14 May 2026 11:55:41 -0700


praneethkaturi commented on issue #14869:
URL: https://github.com/apache/hudi/issues/14869#issuecomment-4453786892


   Hi @xushiyan, thanks for letting me take this on. This is my first 
contribution to Hudi and I'm still finding my way around the codebase, so I'd 
really appreciate some guidance before I commit to a direction.
   
   I've spent the last few days reading the Spark read path - DefaultSource, 
IncrementalRelationV2, HoodieFileIndex and I think I have a rough mental model 
of how snapshot and incremental queries are dispatched today. But when I 
re-read the ticket, I realized I'm not sure what behavior we actually want to 
ship. A few different things would all be consistent with the description:
   
   1. Auto-detect and rewrite. A normal SELECT * FROM trips WHERE 
_hoodie_commit_time > 'X' gets transparently turned into an incremental query 
under the hood.
   2. Just make the existing query faster. Same SQL stays a snapshot query, but 
Hudi uses the _hoodie_commit_time predicate to skip files that can't possibly 
match.
   3. SQL-native syntax for the existing options. Instead of WHERE-clause 
magic, introduce a hint or new syntax so users can say "this is an incremental 
query starting at X" directly in SQL.
   
   These feel quite different to me both in scope and risk, and I don't have 
enough context to pick the right one on my own.  A few questions to help me get 
unstuck:
   
   1. Engine scope. Should I focus on Spark for now, or is this expected to 
work for Flink too? The ticket mentions DeltaStreamer (Spark-only) but the 
title is general, so I wasn't sure.
   2. Which of the three interpretations matches what you had in mind? Or is 
the scope still open and you'd like me to propose one?
   3. Are there known traps — things like "don't change the meaning of a 
regular SELECT" or "must work with archived commits" that I should design 
around?
   
   **Any additional info regarding the scope will be helpful too**! Apologies 
for the long-winded comment, I'd rather
   over-communicate at the start than head off in the wrong direction and waste 
your time on review.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] use commit_time in the WHERE STATEMENT to optimize the incremental query [hudi]

Reply via email to