I was having a discussion recently about Arrow and the topic of server-side filtering vs. client-side filtering came up.
The basic problem is this: If you have a RecordBatch that you wish to filter out some of the "rows", one way to track this in-memory is to create a separate array of true/false values instead of forcing a materialization of a filtered RecordBatch. So you might have a record batch with 2 fields and 4 rows a: [1, 2, 3, 4] b: ['foo', 'bar', 'baz', 'qux'] and then a filter is_selected: [true, true, false, true] This can be easily handled as an application-level concern. Creating a bitmask is generally cheap relative to materializing the filtered version, and some operators may support "pushing down" such filters (e.g. aggregations may accept a selection mask to exclude "off" values). I myself implemented such a scheme in the past for a query engine I built in the 2013-2014 time frame and it yielded material performance improvements in some cases. One question is what you should do when you want to put the data on the wire, e.g. via RPC / Flight or IPC. Two options * Pass the complete RecordBatch, plus the filter as a "special" field and attach some metadata so that you know that filter field is "special" * Filter the RecordBatch before sending, send only the selected rows The first option can of course be implemented as an application-level detail, much as we handle the serialization of pandas row indexes right now (where we have custom pandas metadata and "special" fields/columns for the index arrays). But it could be a common enough use case to merit a more well-defined standardized approach. I'm not sure what the answer is, but I wanted to describe the problem as I see it and see if anyone has any thoughts about it. I'm aware that Dremio is a user of "selection vectors" (selected indices, instead of boolean true/false values, so we would have [0, 1, 3] in the above case), so the similar discussion may apply to passing selection vectors on the wire. - Wes