[ 
https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621398#comment-17621398
 ] 

Weston Pace commented on ARROW-18113:
-------------------------------------

On reflection, I don't really prefer my automagic suggestion.  I think an 
explicit multi-read API added to the filesystem would be a good way to go.  I 
don't see it as an extension of ReadAsync though.  Something like:

{noformat}
  /// \brief Request multiple reads at once
  ///
  /// The underlying filesystem may optimize these reads by coalescing small 
reads into
  /// large reads or by breaking up large reads into multiple parallel smaller 
reads.  The
  /// reads should be issued in parallel if it makes sense for the filesystem.
  ///
  /// One future will be returned for each input read range.  Multiple returned 
futures
  /// may correspond to a single read.  Or, a single returned future may be a 
combined
  /// result of several individual reads.
  ///
  /// \param[in] ranges The ranges to read
  /// \return A future that will complete with the data from the requested 
range is
  /// available
  virtual std::vector<Future<std::shared_ptr<Buffer>>> ReadManyAsync(
      const IOContext&, const std::vector<ReadRange>& ranges);
{noformat}

There could be a default implementation (perhaps relying on configurable 
protected min_hole_size_ and max_contiguous_read_size_ variables) so that 
filesystems would only need to provide a specialized alternative where it made 
sense.  In the future it would be interesting to benchmark and see if 
[preadv|https://linux.die.net/man/2/preadv] can be used to provide a more 
optimized version for the local filesystem.

I'd also be curious to know how an API like this could be adapted (or whether 
my proposal fits) for something like io_uring [~sakras]

> Implement a read range process without caching
> ----------------------------------------------
>
>                 Key: ARROW-18113
>                 URL: https://issues.apache.org/jira/browse/ARROW-18113
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Percy Camilo Triveño Aucahuasi
>            Assignee: Percy Camilo Triveño Aucahuasi
>            Priority: Major
>
> The current 
> [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100]
>  is mixing caching with coalescing and making difficult to implement readers 
> capable to really perform concurrent reads on coalesced data (see this 
> [github 
> comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for 
> additional context); for instance, right now the prebuffering feature of 
> those readers cannot handle concurrent invocations.
> The goal for this ticket is to implement a similar component to 
> ReadRangeCache for performing non-cache reads (doing only the coalescing part 
> instead).  So, once we have that new capability, we can port the parquet and 
> IPC readers to this new component and keep improving the reading process 
> (that would be part of other set of follow-up tickets).  Similar ideas were 
> mentioned here https://issues.apache.org/jira/browse/ARROW-17599
> Maybe a good place to implement this new capability is inside the file system 
> abstraction (as part of a dedicated method to read coalesced data) and where 
> the abstract file system can provide a default implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to