Le 06/02/2020 à 17:07, Wes McKinney a écrit : > In case folks are interested in how some other systems deal with IO > management / scheduling, the comments in > > https://github.com/apache/impala/blob/master/be/src/runtime/io/disk-io-mgr.h > > and related files might be interesting
Thanks. There's quite a lot of functionality. It would be useful to discuss which parts of that functionality is desirable, and which are not. For example, I don't think we should spend development time writing a complex IO scheduler (using which heuristics?) like Impala has, but that's my opinion :-) Regards Antoine. > On Thu, Feb 6, 2020 at 9:26 AM Wes McKinney <wesmck...@gmail.com> wrote: >> >> On Thu, Feb 6, 2020 at 2:46 AM Antoine Pitrou <solip...@pitrou.net> wrote: >>> >>> On Wed, 5 Feb 2020 15:46:15 -0600 >>> Wes McKinney <wesmck...@gmail.com> wrote: >>>> >>>> I'll comment in more detail on some of the other items in due course, >>>> but I think this should be handled by an implementation of >>>> RandomAccessFile (that wraps a naked RandomAccessFile) with some >>>> additional methods, rather than adding this to the abstract >>>> RandomAccessFile interface, e.g. >>>> >>>> class CachingInputFile : public RandomAccessFile { >>>> public: >>>> CachingInputFile(std::shared_ptr<RandomAccessFile> naked_file); >>>> Status CacheRanges(...); >>>> }; >>>> >>>> etc. >>> >>> IMHO it may be more beneficial to expose it as an asynchronous API on >>> RandomAccessFile, for example: >>> class RandomAccessFile { >>> public: >>> struct Range { >>> int64_t offset; >>> int64_t length; >>> }; >>> >>> std::vector<Promise<std::shared_ptr<Buffer>>> >>> ReadRangesAsync(std::vector<Range> ranges); >>> }; >>> >>> >>> The reason is that some APIs such as the C++ AWS S3 API have their own >>> async support, which may be beneficial to use over a generic Arrow >>> thread-pool implementation. >>> >>> Also, by returning a Promise instead of simply caching the results, you >>> make it easier to handle the lifetime of the results. >> >> This seems useful, too. It becomes a question of where do you want to >> manage the cached memory segments, however you obtain them. I'm >> arguing that we should not have much custom code in the Parquet >> library to manage the prefetched segments (and providing the correct >> buffer slice to each column reader when they need it), and instead >> encapsulate this logic so it can be reused. >> >> The API I proposed was just a mockup, I agree it would make sense for >> the prefetching to occur asynchronously so that a column reader can >> proceed as soon as its coalesced chunk has been prefetched, rather >> than having to wait synchronously for all prefetching to complete. >> >>> >>> (Promise<T> can be something like std::future<Result<T>>, though >>> std::future<> has annoying limitations and we may want to write our own >>> instead) >>> >>> Regards >>> >>> Antoine. >>> >>>