Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Antoine Pitrou Thu, 06 Feb 2020 09:26:25 -0800


Le 06/02/2020 à 17:07, Wes McKinney a écrit :
> In case folks are interested in how some other systems deal with IO
> management / scheduling, the comments in
> 
> https://github.com/apache/impala/blob/master/be/src/runtime/io/disk-io-mgr.h
> 
> and related files might be interesting


Thanks.  There's quite a lot of functionality.  It would be useful to
discuss which parts of that functionality is desirable, and which are
not.  For example, I don't think we should spend development time
writing a complex IO scheduler (using which heuristics?) like Impala
has, but that's my opinion :-)

Regards

Antoine.


> On Thu, Feb 6, 2020 at 9:26 AM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> On Thu, Feb 6, 2020 at 2:46 AM Antoine Pitrou <solip...@pitrou.net> wrote:
>>>
>>> On Wed, 5 Feb 2020 15:46:15 -0600
>>> Wes McKinney <wesmck...@gmail.com> wrote:
>>>>
>>>> I'll comment in more detail on some of the other items in due course,
>>>> but I think this should be handled by an implementation of
>>>> RandomAccessFile (that wraps a naked RandomAccessFile) with some
>>>> additional methods, rather than adding this to the abstract
>>>> RandomAccessFile interface, e.g.
>>>>
>>>> class CachingInputFile : public RandomAccessFile {
>>>>  public:
>>>>    CachingInputFile(std::shared_ptr<RandomAccessFile> naked_file);
>>>>    Status CacheRanges(...);
>>>> };
>>>>
>>>> etc.
>>>
>>> IMHO it may be more beneficial to expose it as an asynchronous API on
>>> RandomAccessFile, for example:
>>> class RandomAccessFile {
>>>  public:
>>>   struct Range {
>>>     int64_t offset;
>>>     int64_t length;
>>>   };
>>>
>>>   std::vector<Promise<std::shared_ptr<Buffer>>>
>>>     ReadRangesAsync(std::vector<Range> ranges);
>>> };
>>>
>>>
>>> The reason is that some APIs such as the C++ AWS S3 API have their own
>>> async support, which may be beneficial to use over a generic Arrow
>>> thread-pool implementation.
>>>
>>> Also, by returning a Promise instead of simply caching the results, you
>>> make it easier to handle the lifetime of the results.
>>
>> This seems useful, too. It becomes a question of where do you want to
>> manage the cached memory segments, however you obtain them. I'm
>> arguing that we should not have much custom code in the Parquet
>> library to manage the prefetched segments (and providing the correct
>> buffer slice to each column reader when they need it), and instead
>> encapsulate this logic so it can be reused.
>>
>> The API I proposed was just a mockup, I agree it would make sense for
>> the prefetching to occur asynchronously so that a column reader can
>> proceed as soon as its coalesced chunk has been prefetched, rather
>> than having to wait synchronously for all prefetching to complete.
>>
>>>
>>> (Promise<T> can be something like std::future<Result<T>>, though
>>> std::future<> has annoying limitations and we may want to write our own
>>> instead)
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

Reply via email to