hi all, There have been many discussions in passing on various issues and JIRA tickets over the last months and years about how to manage dictionary-encoded columnar arrays in-memory in C++. Here's a list of some problems we have encountered:
* Dictionaries that may differ from one record batch to another, but represent semantically parts of the same dataset. For example, each row group in Parquet format may have different dictionaries, and decoding from encoded to "dense" representation may be undesirable * Support for dictionary "deltas" in the IPC protocol, using the isDelta flag on DictionaryBatch [1]. It's conceivable we might want to allow dictionaries to change altogether in the IPC protocol (where a dictionary id appears again -- a replacement -- but isDelta = false) * Receiving a record batch schema without the dictionaries attached (e.g. in Arrow Flight), see also experimental patch [2] * Encoded datasets may be produced in a distributed system where "reconciling" the dictionary is not computationally feasible or desirable (particularly if the encoded data is just going to be aggregated) The reason that these are "problems" has to do with the way that we represent dictionary encoded data in C++. We have created a "synthetic" DictionaryType object that is used for Array/Column types or for an entry in arrow::Schema. The DictionaryType wraps the index type (signed integer type) and the dictionary itself. So from Python we have >>> t = pa.dictionary(pa.int8(), pa.array(['a', 'b', 'c', 'd'])) >>> t DictionaryType(dictionary<values=string, indices=int8, ordered=0>) >>> t.dictionary <pyarrow.lib.StringArray object at 0x7eff4847fd68> [ "a", "b", "c", "d" ] >>> t.index_type DataType(int8) This is useful in languages like Python because we have the notion of Categorical types which are a combination of a static dictionary and a contiguous array of indices. It creates problems when we have changing dictionaries, because the "schema" under this in-memory construction may change from record batch to record batch. This means that types are deemed "unequal" according to many code paths we've written. To consider solutions to this problem, I want to first point out that the way we are dealing with dictionary-encoded data in memory is a purely semantic construct for C++ and the binding languages. "Dictionary" is not a data type as all according to the Arrow IPC protocol -- it is a method is transferring encoded / compressed data, and the handling thereof is left to the implementations. There are benefits to the method we are using now, in particular it makes dynamic dispatch (including the visitor pattern, and virtual functions) based on whether something is encoded or not simple. It also leads to simple round trips of Categorical types from libraries like pandas. Here is my proposal to reconcile these issues in C++ * Add a new "synthetic" data type called "variable dictionary" to be used alongside the existing "static dictionary" type. An instance of VariableDictionaryType (name TBD) will not know what the dictionary is, only the data type of the dictionary (e.g. utf8()) and the index type (e.g. int32()) * Define common abstract API for instances of static vs variable dictionary arrays. Mainly this means making DictionaryArray::dictionary [3] virtual * The _actual_ dictionary values for a particular Array must be stored somewhere and lifetime managed. I propose to put these as a single entry in ArrayData::child_data [4]. An alternative to this would be to modify ArrayData to have a dictionary field that would be unused except for encoded datasets This proposal does create some ongoing implementation and maintenance burden, but to that I would make these points: * Many algorithms will dispatch from one type to the other (probably static dispatching to the variable path), so there will not be a need to implement multiple times in most cases * In some algorithms, we may observe a stream of dictionary encoded arrays, and we need only obtain the current dictionary as well as the knowledge of whether it is the same as previous dictionaries. In hash aggregations and other analytics I think we need to implement by default under the assumption of dynamic/variable dictionaries I haven't conceived of any other ideas (after much contemplation) how to algebraically accommodate these use cases in our object model so interested in the opinions of others. As a first use case for this I would be personally looking to address reads of encoded data from Parquet format without an intermediate pass through dense format (which can be slow and wasteful for heavily compressed string data) Thanks, Wes [1]: https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches [2]: https://github.com/apache/arrow/pull/4067 [3]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L968 [4]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L208