hi all,

There have been many discussions in passing on various issues and JIRA
tickets over the last months and years about how to manage
dictionary-encoded columnar arrays in-memory in C++. Here's a list of
some problems we have encountered:

* Dictionaries that may differ from one record batch to another, but
represent semantically parts of the same dataset. For example, each
row group in Parquet format may have different dictionaries, and
decoding from encoded to "dense" representation may be undesirable

* Support for dictionary "deltas" in the IPC protocol, using the
isDelta flag on DictionaryBatch [1]. It's conceivable we might want to
allow dictionaries to change altogether in the IPC protocol (where a
dictionary id appears again -- a replacement -- but isDelta = false)

* Receiving a record batch schema without the dictionaries attached
(e.g. in Arrow Flight), see also experimental patch [2]

* Encoded datasets may be produced in a distributed system where
"reconciling" the dictionary is not computationally feasible or
desirable (particularly if the encoded data is just going to be
aggregated)

The reason that these are "problems" has to do with the way that we
represent dictionary encoded data in C++. We have created a
"synthetic" DictionaryType object that is used for Array/Column types
or for an entry in arrow::Schema. The DictionaryType wraps the index
type (signed integer type) and the dictionary itself. So from Python
we have

>>> t = pa.dictionary(pa.int8(), pa.array(['a', 'b', 'c', 'd']))
>>> t
DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
>>> t.dictionary
<pyarrow.lib.StringArray object at 0x7eff4847fd68>
[
  "a",
  "b",
  "c",
  "d"
]
>>> t.index_type
DataType(int8)

This is useful in languages like Python because we have the notion of
Categorical types which are a combination of a static dictionary and a
contiguous array of indices.

It creates problems when we have changing dictionaries, because the
"schema" under this in-memory construction may change from record
batch to record batch. This means that types are deemed "unequal"
according to many code paths we've written.

To consider solutions to this problem, I want to first point out that
the way we are dealing with dictionary-encoded data in memory is a
purely semantic construct for C++ and the binding languages.
"Dictionary" is not a data type as all according to the Arrow IPC
protocol -- it is a method is transferring encoded / compressed data,
and the handling thereof is left to the implementations. There are
benefits to the method we are using now, in particular it makes
dynamic dispatch (including the visitor pattern, and virtual
functions) based on whether something is encoded or not simple. It
also leads to simple round trips of Categorical types from libraries
like pandas.

Here is my proposal to reconcile these issues in C++

* Add a new "synthetic" data type called "variable dictionary" to be
used alongside the existing "static dictionary" type. An instance of
VariableDictionaryType (name TBD) will not know what the dictionary
is, only the data type of the dictionary (e.g. utf8()) and the index
type (e.g. int32())

* Define common abstract API for instances of static vs variable
dictionary arrays. Mainly this means making
DictionaryArray::dictionary [3] virtual

* The _actual_ dictionary values for a particular Array must be stored
somewhere and lifetime managed. I propose to put these as a single
entry in ArrayData::child_data [4]. An alternative to this would be to
modify ArrayData to have a dictionary field that would be unused
except for encoded datasets

This proposal does create some ongoing implementation and maintenance
burden, but to that I would make these points:

* Many algorithms will dispatch from one type to the other (probably
static dispatching to the variable path), so there will not be a need
to implement multiple times in most cases

* In some algorithms, we may observe a stream of dictionary encoded
arrays, and we need only obtain the current dictionary as well as the
knowledge of whether it is the same as previous dictionaries. In hash
aggregations and other analytics I think we need to implement by
default under the assumption of dynamic/variable dictionaries

I haven't conceived of any other ideas (after much contemplation) how
to algebraically accommodate these use cases in our object model so
interested in the opinions of others. As a first use case for this I
would be personally looking to address reads of encoded data from
Parquet format without an intermediate pass through dense format
(which can be slow and wasteful for heavily compressed string data)

Thanks,
Wes

[1]: 
https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches
[2]: https://github.com/apache/arrow/pull/4067
[3]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L968
[4]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L208

Reply via email to