Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-05-01 Thread David Li
(Apologies for the double-email) In the original coalescing PR, an "AsyncContext" abstraction was discussed. I could imagine being able to hold arbitrary attributes/metrics for tasks that a scheduler could then take into account, while making it easier for applications to thread through all the di

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-05-01 Thread David Li
Wes, >From the outline there it seems like a path forward. I think the mode that would be helpful here is some sort of simple ordering/dependency so that the scheduler knows not to schedule subtasks of B until all subtasks of A have started (but not necessarily finished). I think the other part w

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-05-01 Thread Wes McKinney
I just wrote up a ticket about a general purpose multi-consumer scheduler API, do you think this could be the beginning of a resolution? https://issues.apache.org/jira/browse/ARROW-8667 We may also want to design in some affordances so that no consumer is ever 100% blocked, even if that causes te

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread David Li
Francois, Thanks for the pointers. I'll see if I can put together a proof-of-concept, might that help discussion? I agree it would be good to make it format-agnostic. I'm also curious what thoughts you'd have on how to manage cross-file parallelism (coalescing only helps within a file). If we just

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Antoine Pitrou
If we want to discuss IO APIs we should do that comprehensively. There are various ways of expressing what we want to do (explicit readahead, fadvise-like APIs, async APIs, etc.). Regards Antoine. Le 30/04/2020 à 15:08, Francois Saint-Jacques a écrit : > One more point, > > It would seem ben

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Francois Saint-Jacques
One more point, It would seem beneficial if we could express this in `RandomAccessFile::ReadAhead(vector)` method: no async buffering/coalescing would be needed. In the case of Parquet, we'd get the _exact_ ranges computed from the medata.This method would also possibly benefit other filesystems s

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Francois Saint-Jacques
Hello David, I think that what you ask is achievable with the dataset API without much effort. You'd have to insert the pre-buffering at ParquetFileFormat::ScanFile [1]. The top-level Scanner::Scan method is essentially a generator that looks like flatmap(Iterator>). It consumes the fragment in-or

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread David Li
Sure, and we are still interested in collaborating. The main use case we have is scanning datasets in order of the partition key; it seems ordering is the only missing thing from Antoine's comments. However, from briefly playing around with the Python API, an application could manually order the fr

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Joris Van den Bossche
On Thu, 30 Apr 2020 at 04:06, Wes McKinney wrote: > On Wed, Apr 29, 2020 at 6:54 PM David Li wrote: > > > > Ah, sorry, so I am being somewhat unclear here. Yes, you aren't > > guaranteed to download all the files in order, but with more control, > > you can make this more likely. You can also pr

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-29 Thread Wes McKinney
On Wed, Apr 29, 2020 at 6:54 PM David Li wrote: > > Ah, sorry, so I am being somewhat unclear here. Yes, you aren't > guaranteed to download all the files in order, but with more control, > you can make this more likely. You can also prevent the case where due > to scheduling, file N+1 doesn't eve

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-29 Thread David Li
Ah, sorry, so I am being somewhat unclear here. Yes, you aren't guaranteed to download all the files in order, but with more control, you can make this more likely. You can also prevent the case where due to scheduling, file N+1 doesn't even start downloading until after file N+2, which can happen

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-29 Thread Antoine Pitrou
Le 29/04/2020 à 23:30, David Li a écrit : > Sure - > > The use case is to read a large partitioned dataset, consisting of > tens or hundreds of Parquet files. A reader expects to scan through > the data in order of the partition key. However, to improve > performance, we'd like to begin loading

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-29 Thread David Li
Sure - The use case is to read a large partitioned dataset, consisting of tens or hundreds of Parquet files. A reader expects to scan through the data in order of the partition key. However, to improve performance, we'd like to begin loading files N+1, N+2, ... N + k while the consumer is still re

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-29 Thread Antoine Pitrou
Le 29/04/2020 à 20:49, David Li a écrit : > > However, we noticed this doesn’t actually bring us the expected > benefits. Consider files A, B, and C being buffered in parallel; right > now, all I/O goes through an internal I/O pool, and so several > operations for each of the three files get add

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-29 Thread David Li
Hi all, I’d like to follow up on this discussion. Thanks to Antoine, we now have a read coalescing implementation in-tree which shows clear performance benefits both when reading plain files and Parquet files[1]. We now have some follow-up work where we think the design and implementation could be

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-03-23 Thread David Li
Thanks. I've set up an AWS account for my own testing for now. I've also submitted a PR to add a basic benchmark which can be run self-contained, against a local Minio instance, or against S3: https://github.com/apache/arrow/pull/6675 I ran the benchmark from my local machine, and I can test from

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-03-22 Thread Wes McKinney
On Thu, Mar 19, 2020 at 10:04 AM David Li wrote: > > > That's why it's important that we set ourselves up to do performance > > testing in a realistic environment in AWS rather than simulating it. > > For my clarification, what are the plans for this (if any)? I couldn't > find any prior discussi

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-03-19 Thread David Li
> That's why it's important that we set ourselves up to do performance testing > in a realistic environment in AWS rather than simulating it. For my clarification, what are the plans for this (if any)? I couldn't find any prior discussion, though it sounds like the discussion around cloud CI capa

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-03-18 Thread David Li
For us it applies to S3-like systems, not only S3 itself, at least. It does make sense to limit it to some filesystems. The behavior would be opt-in at the Parquet reader level, so at the Datasets or Filesystem layer we can take care of enabling the flag for filesystems where it actually helps. I

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-03-18 Thread Antoine Pitrou
Le 18/03/2020 à 18:30, David Li a écrit : >> Instead of S3, you can use the Slow streams and Slow filesystem >> implementations. It may better protect against varying external conditions. > > I think we'd want several different benchmarks - we want to ensure we > don't regress local filesystem

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-03-18 Thread David Li
> Instead of S3, you can use the Slow streams and Slow filesystem > implementations. It may better protect against varying external conditions. I think we'd want several different benchmarks - we want to ensure we don't regress local filesystem performance, and we also want to measure in an actu

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-03-18 Thread Wes McKinney
On Wed, Mar 18, 2020 at 11:42 AM Antoine Pitrou wrote: > > > Le 18/03/2020 à 17:36, David Li a écrit : > > Hi all, > > > > Thanks to Antoine for implementing the core read coalescing logic. > > > > We've taken a look at what else needs to be done to get this working, > > and it sounds like the fol

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-03-18 Thread Wes McKinney
hi David, Yes, this sounds right to me. I would say that we should come up with the public API for column prebuffering ASAP and then get to work on implementing it and working to maximize the throughput. - Wes On Wed, Mar 18, 2020 at 11:37 AM David Li wrote: > > Hi all, > > Thanks to Antoine fo

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-03-18 Thread Antoine Pitrou
Le 18/03/2020 à 17:36, David Li a écrit : > Hi all, > > Thanks to Antoine for implementing the core read coalescing logic. > > We've taken a look at what else needs to be done to get this working, > and it sounds like the following changes would be worthwhile, > independent of the rest of the o

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-03-18 Thread David Li
Hi all, Thanks to Antoine for implementing the core read coalescing logic. We've taken a look at what else needs to be done to get this working, and it sounds like the following changes would be worthwhile, independent of the rest of the optimizations we discussed: - Add benchmarks of the curren

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread David Li
Catching up on questions here... > Typically you can solve this by having enough IO concurrency at once :-) > I'm not sure having sophisticated global coordination (based on which > algorithms) would bring anything. Would you care to elaborate? We aren't proposing *sophisticated* global coordina

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020 at 1:30 PM Antoine Pitrou wrote: > > > Le 06/02/2020 à 20:20, Wes McKinney a écrit : > >> Actually, on a more high-level basis, is the goal to prefetch for > >> sequential consumption of row groups? > >> > > > > Essentially yes. One "easy" optimization is to prefetch the entire

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
Le 06/02/2020 à 20:20, Wes McKinney a écrit : >> Actually, on a more high-level basis, is the goal to prefetch for >> sequential consumption of row groups? >> > > Essentially yes. One "easy" optimization is to prefetch the entire > serialized row group. This is an evolution of that idea where we

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020, 12:42 PM Antoine Pitrou wrote: > > Le 06/02/2020 à 19:40, Antoine Pitrou a écrit : > > > > Le 06/02/2020 à 19:37, Wes McKinney a écrit : > >> On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou > wrote: > >> > >>> Le 06/02/2020 à 16:26, Wes McKinney a écrit : > > This see

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020, 12:41 PM Antoine Pitrou wrote: > > Le 06/02/2020 à 19:37, Wes McKinney a écrit : > > On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou wrote: > > > >> Le 06/02/2020 à 16:26, Wes McKinney a écrit : > >>> > >>> This seems useful, too. It becomes a question of where do you want to >

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
Le 06/02/2020 à 19:40, Antoine Pitrou a écrit : > > Le 06/02/2020 à 19:37, Wes McKinney a écrit : >> On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou wrote: >> >>> Le 06/02/2020 à 16:26, Wes McKinney a écrit : This seems useful, too. It becomes a question of where do you want to mana

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
Le 06/02/2020 à 19:37, Wes McKinney a écrit : > On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou wrote: > >> Le 06/02/2020 à 16:26, Wes McKinney a écrit : >>> >>> This seems useful, too. It becomes a question of where do you want to >>> manage the cached memory segments, however you obtain them. I'

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou wrote: > > Le 06/02/2020 à 16:26, Wes McKinney a écrit : > > > > This seems useful, too. It becomes a question of where do you want to > > manage the cached memory segments, however you obtain them. I'm > > arguing that we should not have much custom c

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
Le 06/02/2020 à 16:26, Wes McKinney a écrit : > > This seems useful, too. It becomes a question of where do you want to > manage the cached memory segments, however you obtain them. I'm > arguing that we should not have much custom code in the Parquet > library to manage the prefetched segments

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
Le 06/02/2020 à 17:07, Wes McKinney a écrit : > In case folks are interested in how some other systems deal with IO > management / scheduling, the comments in > > https://github.com/apache/impala/blob/master/be/src/runtime/io/disk-io-mgr.h > > and related files might be interesting Thanks. Th

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
In case folks are interested in how some other systems deal with IO management / scheduling, the comments in https://github.com/apache/impala/blob/master/be/src/runtime/io/disk-io-mgr.h and related files might be interesting On Thu, Feb 6, 2020 at 9:26 AM Wes McKinney wrote: > > On Thu, Feb 6,

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020 at 2:46 AM Antoine Pitrou wrote: > > On Wed, 5 Feb 2020 15:46:15 -0600 > Wes McKinney wrote: > > > > I'll comment in more detail on some of the other items in due course, > > but I think this should be handled by an implementation of > > RandomAccessFile (that wraps a naked Ra

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
On Wed, 5 Feb 2020 16:37:17 -0500 David Li wrote: > > As a separate step, prefetching/caching should also make use of a > global (or otherwise shared) IO thread pool, so that parallel reads of > different files implicitly coordinate work with each other as well. > Then, you could queue up reads o

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
On Wed, 5 Feb 2020 15:46:15 -0600 Wes McKinney wrote: > > I'll comment in more detail on some of the other items in due course, > but I think this should be handled by an implementation of > RandomAccessFile (that wraps a naked RandomAccessFile) with some > additional methods, rather than adding

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-05 Thread Wes McKinney
On Wed, Feb 5, 2020 at 3:37 PM David Li wrote: > > Hi Antoine and Wes, > > Thanks for the feedback. Yes, we should definitely consider these as > separate features. > > I agree that it makes sense for the file API (or a derived API) to > expose a generic CacheRanges or PrebufferRanges API. It coul

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-05 Thread David Li
Hi Antoine and Wes, Thanks for the feedback. Yes, we should definitely consider these as separate features. I agree that it makes sense for the file API (or a derived API) to expose a generic CacheRanges or PrebufferRanges API. It could then do coalescing and prefetching as desired based on the a

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-05 Thread Wes McKinney
I agree with separating the problem into its constituent concerns to make sure that we are developing appropriate abstractions. Speaking specifically about the Parquet codebase, the way that we access a particular ColumnChunk in a row group is fairly simplistic. See the ReaderProperties::GetStream

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-05 Thread Antoine Pitrou
Hi David, I think we should discuss this as individual features. > Read Coalescing: from Parquet metadata, we know exactly> which byte ranges of > a file will be read, and can “cheatin the S3 IO layer by fetching them in advance It seems there are two things here: coalescing individual reads,