> On Jan 28, 2019, at 11:47 AM, Wes McKinney <wesmck...@gmail.com> wrote: > > On Mon, Jan 28, 2019 at 12:05 AM Ravindra Pindikura <ravin...@dremio.com > <mailto:ravin...@dremio.com>> wrote: >> >> >> >>> On Jan 28, 2019, at 11:22 AM, Wes McKinney <wesmck...@gmail.com> wrote: >>> >>> I was having a discussion recently about Arrow and the topic of >>> server-side filtering vs. client-side filtering came up. >>> >>> The basic problem is this: >>> >>> If you have a RecordBatch that you wish to filter out some of the >>> "rows", one way to track this in-memory is to create a separate array >>> of true/false values instead of forcing a materialization of a >>> filtered RecordBatch. >>> >>> So you might have a record batch with 2 fields and 4 rows >>> >>> a: [1, 2, 3, 4] >>> b: ['foo', 'bar', 'baz', 'qux'] >>> >>> and then a filter >>> >>> is_selected: [true, true, false, true] >> >> In Gandiva too, we use a slight variant of this : a selection vector that >> has indices of the selected records. >> >>> >>> This can be easily handled as an application-level concern. Creating a >>> bitmask is generally cheap relative to materializing the filtered >>> version, and some operators may support "pushing down" such filters >>> (e.g. aggregations may accept a selection mask to exclude "off" >>> values). I myself implemented such a scheme in the past for a query >>> engine I built in the 2013-2014 time frame and it yielded material >>> performance improvements in some cases. >>> >>> One question is what you should do when you want to put the data on >>> the wire, e.g. via RPC / Flight or IPC. Two options >>> >>> * Pass the complete RecordBatch, plus the filter as a "special" field >>> and attach some metadata so that you know that filter field is >>> "special" >>> >>> * Filter the RecordBatch before sending, send only the selected rows >>> >>> The first option can of course be implemented as an application-level >>> detail, much as we handle the serialization of pandas row indexes >>> right now (where we have custom pandas metadata and "special" >>> fields/columns for the index arrays). But it could be a common enough >>> use case to merit a more well-defined standardized approach. >> >>> >>> I'm not sure what the answer is, but I wanted to describe the problem >>> as I see it and see if anyone has any thoughts about it. >>> >>> I'm aware that Dremio is a user of "selection vectors" (selected >>> indices, instead of boolean true/false values, so we would have [0, 1, >>> 3] in the above case), so the similar discussion may apply to passing >>> selection vectors on the wire. >> >> I’m not sure if it’s worth passing the unfiltered record batch on the wire >> (the cost of additional network bytes will likely outweigh the >> materiailzation cost). But, it’s definitely beneficial between operators >> (eg. Filter and projector). > > I would say that it depends. A boolean mask could be used to implement > DELETE operations in an in-memory dataset. If only a small fraction of > the entries are deleted, then you may want to send the unfiltered > batch.
Yeah, agree. > >> >> Also, having an efficient way to materialize the filtered version will be >> useful (i.e selection vector remover). >> >>> >>> - Wes