[ https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621427#comment-17621427 ]
Weston Pace commented on ARROW-18113: ------------------------------------- > Just to be clear: to the filesystem, or on the reader itself? Oops, I mean on {{RandomAccessFile}}. > Also, I'm not clear on: "Multiple returned futures may correspond to a single > read. Or, a single returned future may be a combined result of several > individual reads." Isn't this saying the same thing twice? I might call {{file->ReadMany({0, 3}, {3, 8}, {1024, 16Mi})}}. The filesystem could then implement this as: {noformat} std::vector<Future> futures; # The first two futures correspond to the same read Future<Buffer> coalesced_read = ReadAsync(0, 8); futures.push_back(coalesced_read.Then(buf => buf.Split(0, 3))); futures.push_back(coalesced_read.Then(buf => buf.Split(3, 5))); # The third future corresponds to two reads Future<Buffer> part_one = ReadAsync(1024, 8Mi); Future<Buffer> part_two = ReadAsync(1024+8Mi, 8Mi-1024); futures.push_back(AllComplete({part_one, part_two}).Then(bufs => Concatenate(bufs)); {noformat} > Implement a read range process without caching > ---------------------------------------------- > > Key: ARROW-18113 > URL: https://issues.apache.org/jira/browse/ARROW-18113 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Percy Camilo Triveño Aucahuasi > Assignee: Percy Camilo Triveño Aucahuasi > Priority: Major > > The current > [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100] > is mixing caching with coalescing and making difficult to implement readers > capable to really perform concurrent reads on coalesced data (see this > [github > comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for > additional context); for instance, right now the prebuffering feature of > those readers cannot handle concurrent invocations. > The goal for this ticket is to implement a similar component to > ReadRangeCache for performing non-cache reads (doing only the coalescing part > instead). So, once we have that new capability, we can port the parquet and > IPC readers to this new component and keep improving the reading process > (that would be part of other set of follow-up tickets). Similar ideas were > mentioned here https://issues.apache.org/jira/browse/ARROW-17599 > Maybe a good place to implement this new capability is inside the file system > abstraction (as part of a dedicated method to read coalesced data) and where > the abstract file system can provide a default implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)