On Mon, Jan 28, 2019 at 12:05 AM Ravindra Pindikura <ravin...@dremio.com> wrote:
>
>
>
> > On Jan 28, 2019, at 11:22 AM, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > I was having a discussion recently about Arrow and the topic of
> > server-side filtering vs. client-side filtering came up.
> >
> > The basic problem is this:
> >
> > If you have a RecordBatch that you wish to filter out some of the
> > "rows", one way to track this in-memory is to create a separate array
> > of true/false values instead of forcing a materialization of a
> > filtered RecordBatch.
> >
> > So you might have a record batch with 2 fields and 4 rows
> >
> > a: [1, 2, 3, 4]
> > b: ['foo', 'bar', 'baz', 'qux']
> >
> > and then a filter
> >
> > is_selected: [true, true, false, true]
>
> In Gandiva too, we use a slight variant of this : a selection vector that has 
> indices of the selected records.
>
> >
> > This can be easily handled as an application-level concern. Creating a
> > bitmask is generally cheap relative to materializing the filtered
> > version, and some operators may support "pushing down" such filters
> > (e.g. aggregations may accept a selection mask to exclude "off"
> > values). I myself implemented such a scheme in the past for a query
> > engine I built in the 2013-2014 time frame and it yielded material
> > performance improvements in some cases.
> >
> > One question is what you should do when you want to put the data on
> > the wire, e.g. via RPC / Flight or IPC. Two options
> >
> > * Pass the complete RecordBatch, plus the filter as a "special" field
> > and attach some metadata so that you know that filter field is
> > "special"
> >
> > * Filter the RecordBatch before sending, send only the selected rows
> >
> > The first option can of course be implemented as an application-level
> > detail, much as we handle the serialization of pandas row indexes
> > right now (where we have custom pandas metadata and "special"
> > fields/columns for the index arrays). But it could be a common enough
> > use case to merit a more well-defined standardized approach.
>
> >
> > I'm not sure what the answer is, but I wanted to describe the problem
> > as I see it and see if anyone has any thoughts about it.
> >
> > I'm aware that Dremio is a user of "selection vectors" (selected
> > indices, instead of boolean true/false values, so we would have [0, 1,
> > 3] in the above case), so the similar discussion may apply to passing
> > selection vectors on the wire.
>
> I’m not sure if it’s worth passing the unfiltered record batch on the wire 
> (the cost of additional network bytes will likely outweigh the 
> materiailzation cost). But, it’s definitely beneficial between operators (eg. 
> Filter and projector).

I would say that it depends. A boolean mask could be used to implement
DELETE operations in an in-memory dataset. If only a small fraction of
the entries are deleted, then you may want to send the unfiltered
batch.

>
> Also, having an efficient way to materialize the filtered version will be 
> useful (i.e selection vector remover).
>
> >
> > - Wes
>

Reply via email to