Re: [Format] Passing selection masks with Arrow record batches

Ravindra Pindikura Sun, 27 Jan 2019 22:23:22 -0800


> On Jan 28, 2019, at 11:47 AM, Wes McKinney <wesmck...@gmail.com> wrote:
> 
> On Mon, Jan 28, 2019 at 12:05 AM Ravindra Pindikura <ravin...@dremio.com 
> <mailto:ravin...@dremio.com>> wrote:
>> 
>> 
>> 
>>> On Jan 28, 2019, at 11:22 AM, Wes McKinney <wesmck...@gmail.com> wrote:
>>> 
>>> I was having a discussion recently about Arrow and the topic of
>>> server-side filtering vs. client-side filtering came up.
>>> 
>>> The basic problem is this:
>>> 
>>> If you have a RecordBatch that you wish to filter out some of the
>>> "rows", one way to track this in-memory is to create a separate array
>>> of true/false values instead of forcing a materialization of a
>>> filtered RecordBatch.
>>> 
>>> So you might have a record batch with 2 fields and 4 rows
>>> 
>>> a: [1, 2, 3, 4]
>>> b: ['foo', 'bar', 'baz', 'qux']
>>> 
>>> and then a filter
>>> 
>>> is_selected: [true, true, false, true]
>> 
>> In Gandiva too, we use a slight variant of this : a selection vector that 
>> has indices of the selected records.
>> 
>>> 
>>> This can be easily handled as an application-level concern. Creating a
>>> bitmask is generally cheap relative to materializing the filtered
>>> version, and some operators may support "pushing down" such filters
>>> (e.g. aggregations may accept a selection mask to exclude "off"
>>> values). I myself implemented such a scheme in the past for a query
>>> engine I built in the 2013-2014 time frame and it yielded material
>>> performance improvements in some cases.
>>> 
>>> One question is what you should do when you want to put the data on
>>> the wire, e.g. via RPC / Flight or IPC. Two options
>>> 
>>> * Pass the complete RecordBatch, plus the filter as a "special" field
>>> and attach some metadata so that you know that filter field is
>>> "special"
>>> 
>>> * Filter the RecordBatch before sending, send only the selected rows
>>> 
>>> The first option can of course be implemented as an application-level
>>> detail, much as we handle the serialization of pandas row indexes
>>> right now (where we have custom pandas metadata and "special"
>>> fields/columns for the index arrays). But it could be a common enough
>>> use case to merit a more well-defined standardized approach.
>> 
>>> 
>>> I'm not sure what the answer is, but I wanted to describe the problem
>>> as I see it and see if anyone has any thoughts about it.
>>> 
>>> I'm aware that Dremio is a user of "selection vectors" (selected
>>> indices, instead of boolean true/false values, so we would have [0, 1,
>>> 3] in the above case), so the similar discussion may apply to passing
>>> selection vectors on the wire.
>> 
>> I’m not sure if it’s worth passing the unfiltered record batch on the wire 
>> (the cost of additional network bytes will likely outweigh the 
>> materiailzation cost). But, it’s definitely beneficial between operators 
>> (eg. Filter and projector).
> 
> I would say that it depends. A boolean mask could be used to implement
> DELETE operations in an in-memory dataset. If only a small fraction of
> the entries are deleted, then you may want to send the unfiltered
> batch.


Yeah, agree.

> 
>> 
>> Also, having an efficient way to materialize the filtered version will be 
>> useful (i.e selection vector remover).
>> 
>>> 
>>> - Wes

Re: [Format] Passing selection masks with Arrow record batches

Reply via email to