[ https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621398#comment-17621398 ]
Weston Pace commented on ARROW-18113: ------------------------------------- On reflection, I don't really prefer my automagic suggestion. I think an explicit multi-read API added to the filesystem would be a good way to go. I don't see it as an extension of ReadAsync though. Something like: {noformat} /// \brief Request multiple reads at once /// /// The underlying filesystem may optimize these reads by coalescing small reads into /// large reads or by breaking up large reads into multiple parallel smaller reads. The /// reads should be issued in parallel if it makes sense for the filesystem. /// /// One future will be returned for each input read range. Multiple returned futures /// may correspond to a single read. Or, a single returned future may be a combined /// result of several individual reads. /// /// \param[in] ranges The ranges to read /// \return A future that will complete with the data from the requested range is /// available virtual std::vector<Future<std::shared_ptr<Buffer>>> ReadManyAsync( const IOContext&, const std::vector<ReadRange>& ranges); {noformat} There could be a default implementation (perhaps relying on configurable protected min_hole_size_ and max_contiguous_read_size_ variables) so that filesystems would only need to provide a specialized alternative where it made sense. In the future it would be interesting to benchmark and see if [preadv|https://linux.die.net/man/2/preadv] can be used to provide a more optimized version for the local filesystem. I'd also be curious to know how an API like this could be adapted (or whether my proposal fits) for something like io_uring [~sakras] > Implement a read range process without caching > ---------------------------------------------- > > Key: ARROW-18113 > URL: https://issues.apache.org/jira/browse/ARROW-18113 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Percy Camilo Triveño Aucahuasi > Assignee: Percy Camilo Triveño Aucahuasi > Priority: Major > > The current > [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100] > is mixing caching with coalescing and making difficult to implement readers > capable to really perform concurrent reads on coalesced data (see this > [github > comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for > additional context); for instance, right now the prebuffering feature of > those readers cannot handle concurrent invocations. > The goal for this ticket is to implement a similar component to > ReadRangeCache for performing non-cache reads (doing only the coalescing part > instead). So, once we have that new capability, we can port the parquet and > IPC readers to this new component and keep improving the reading process > (that would be part of other set of follow-up tickets). Similar ideas were > mentioned here https://issues.apache.org/jira/browse/ARROW-17599 > Maybe a good place to implement this new capability is inside the file system > abstraction (as part of a dedicated method to read coalesced data) and where > the abstract file system can provide a default implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)