On Mon, Jan 28, 2019 at 12:05 AM Ravindra Pindikura <ravin...@dremio.com> wrote: > > > > > On Jan 28, 2019, at 11:22 AM, Wes McKinney <wesmck...@gmail.com> wrote: > > > > I was having a discussion recently about Arrow and the topic of > > server-side filtering vs. client-side filtering came up. > > > > The basic problem is this: > > > > If you have a RecordBatch that you wish to filter out some of the > > "rows", one way to track this in-memory is to create a separate array > > of true/false values instead of forcing a materialization of a > > filtered RecordBatch. > > > > So you might have a record batch with 2 fields and 4 rows > > > > a: [1, 2, 3, 4] > > b: ['foo', 'bar', 'baz', 'qux'] > > > > and then a filter > > > > is_selected: [true, true, false, true] > > In Gandiva too, we use a slight variant of this : a selection vector that has > indices of the selected records. > > > > > This can be easily handled as an application-level concern. Creating a > > bitmask is generally cheap relative to materializing the filtered > > version, and some operators may support "pushing down" such filters > > (e.g. aggregations may accept a selection mask to exclude "off" > > values). I myself implemented such a scheme in the past for a query > > engine I built in the 2013-2014 time frame and it yielded material > > performance improvements in some cases. > > > > One question is what you should do when you want to put the data on > > the wire, e.g. via RPC / Flight or IPC. Two options > > > > * Pass the complete RecordBatch, plus the filter as a "special" field > > and attach some metadata so that you know that filter field is > > "special" > > > > * Filter the RecordBatch before sending, send only the selected rows > > > > The first option can of course be implemented as an application-level > > detail, much as we handle the serialization of pandas row indexes > > right now (where we have custom pandas metadata and "special" > > fields/columns for the index arrays). But it could be a common enough > > use case to merit a more well-defined standardized approach. > > > > > I'm not sure what the answer is, but I wanted to describe the problem > > as I see it and see if anyone has any thoughts about it. > > > > I'm aware that Dremio is a user of "selection vectors" (selected > > indices, instead of boolean true/false values, so we would have [0, 1, > > 3] in the above case), so the similar discussion may apply to passing > > selection vectors on the wire. > > I’m not sure if it’s worth passing the unfiltered record batch on the wire > (the cost of additional network bytes will likely outweigh the > materiailzation cost). But, it’s definitely beneficial between operators (eg. > Filter and projector).
I would say that it depends. A boolean mask could be used to implement DELETE operations in an in-memory dataset. If only a small fraction of the entries are deleted, then you may want to send the unfiltered batch. > > Also, having an efficient way to materialize the filtered version will be > useful (i.e selection vector remover). > > > > > - Wes >