Hi Micah, Hope everyone is staying safe!
> On Mar 16, 2020, at 9:41 PM, Micah Kornfield <emkornfi...@gmail.com> wrote: > > I feel a little uncomfortable in the fact that there isn't a more clearly > defined dividing line for what belongs in Arrow and what doesn't. I suppose > this is what discussions like these are about :) > > In my mind Arrow has two primary goals: > 1. Wide accessibility > 2. Speed, primarily with the assumption that CPU time is the constraining > resource. Some of the technologies in the Arrow ecosystem (e.g. Feather, > Flight) have blurred this line a little bit. In some cases these technologies > are dominated by other constraints as seen by Wes's compression proposal [1]. If we only look at the narrow scope of iterating over a single Arrow array in memory, perhaps CPU is the dominant constraint (though even there, I would expect L1/L2 cache to be fairly important). Once we expand the scope wider and wider…. For example a large volume of data, loading and translating data from Parquet and disk, etc. etc., then the factors become much more complex. A fairly substantial amount of CPU is needed for translating from Parquet; main memory bandwidth becomes a factor. Thus, it seems speed and constraining factors varies widely by application - and having more encodings might extend the use of Arrow to wider scopes :) > > Given these points, adding complexity and constant factors, at least for an > initial implementation, are something that needs to be looked at very > carefully. I would likely change my mind if there was a demonstration that > the complexity adds substantial value across a variety of data sets/workloads > beyond simpler versions. Perhaps we can have a discussion about what that demonstration would entail? Also, would the Arrow community be open to some kind of “third party encoding” capability? It would facilitate experimentation by others, in real world scenarios, and if those use cases and workloads prove to be useful to others, perhaps the community could then consider adopting them more widely? Cheers, Evan