[
https://issues.apache.org/jira/browse/HUDI-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445162#comment-17445162
]
Vinoth Chandar edited comment on HUDI-2750 at 11/17/21, 1:32 PM:
-----------------------------------------------------------------
+1 on this. Dumping my thoughts here. When the start commit is far away, 2/3
can be more performant, since they already filter out the files that have
already been cleaned etc. Reading the entire timeline archive log can be time
consuming.
I think we can index the timeline as well and support efficient range
retrievals. but wondering why you think 2/3 is just only suitable for full
history reads? Is it because the log files don't have the delta commit instant
today in their names? With these (at-least on object storage), we can figure
out what files changes between any given interval, right?
Is this the gap?
was (Author: vc):
+1 on this. Dumping my thoughts here. When the start commit is far away, 2/3
can be more performant, since they already filter out the files that have
already been cleaned etc. Reading the entire timeline archive log can be time
consuming.
I think we can index the timeline as well and support efficient range
retrievals. but wondering why you think 2/3 is just only suitable for full
history reads? Is it because the log files don't have the delta commit instant
today in their names? With these (at-least on object storage), we can figure
out what files changes between any given interval, right?
Is this the gap?
> Improve the incremental data files metadata more efficiently for streaming
> source
> ---------------------------------------------------------------------------------
>
> Key: HUDI-2750
> URL: https://issues.apache.org/jira/browse/HUDI-2750
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Common Core
> Reporter: Danny Chen
> Priority: Major
> Fix For: 0.11.0
>
>
> There are 3 ways for fetching the incremental data files for streaming read
> now:
> 1. Read the incremental commit metadata and resolve the data files to
> construct the inc filesystem view
> 2. Scan the filesystem directly and filter the data files with start commit
> time if the consuming starts from the 'earliest' offset
> 3. For 2, there is a more efficient way: to look up the metadata table if it
> is enabled
> While these 3 ways are far away from enough for production:
> for 1: there was a bottleneck when the start commit time has been far away
> from now, and the instants may have been archived, it takes too much time to
> load those metadata files, in our production, more than 30 minutes, which is
> unacceptable.
> for 2&3: they are only suitable for cases that read the full history and
> incremental data set.
> We better propose a way to look up the incremental data files with arbitrary
> time interval instants, to construct the filesystem efficiently.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)