praneethkaturi commented on issue #14869: URL: https://github.com/apache/hudi/issues/14869#issuecomment-4453786892
Hi @xushiyan, thanks for letting me take this on. This is my first contribution to Hudi and I'm still finding my way around the codebase, so I'd really appreciate some guidance before I commit to a direction. I've spent the last few days reading the Spark read path - DefaultSource, IncrementalRelationV2, HoodieFileIndex and I think I have a rough mental model of how snapshot and incremental queries are dispatched today. But when I re-read the ticket, I realized I'm not sure what behavior we actually want to ship. A few different things would all be consistent with the description: 1. Auto-detect and rewrite. A normal SELECT * FROM trips WHERE _hoodie_commit_time > 'X' gets transparently turned into an incremental query under the hood. 2. Just make the existing query faster. Same SQL stays a snapshot query, but Hudi uses the _hoodie_commit_time predicate to skip files that can't possibly match. 3. SQL-native syntax for the existing options. Instead of WHERE-clause magic, introduce a hint or new syntax so users can say "this is an incremental query starting at X" directly in SQL. These feel quite different to me both in scope and risk, and I don't have enough context to pick the right one on my own. A few questions to help me get unstuck: 1. Engine scope. Should I focus on Spark for now, or is this expected to work for Flink too? The ticket mentions DeltaStreamer (Spark-only) but the title is general, so I wasn't sure. 2. Which of the three interpretations matches what you had in mind? Or is the scope still open and you'd like me to propose one? 3. Are there known traps — things like "don't change the meaning of a regular SELECT" or "must work with archived commits" that I should design around? **Any additional info regarding the scope will be helpful too**! Apologies for the long-winded comment, I'd rather over-communicate at the start than head off in the wrong direction and waste your time on review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
