hi Hatem, Thanks for commenting.
I am not sure your solution will work reliably because code is written against arrow::DictionaryType with the presumption that the dictionary is known and static, and can be obtained by invoking DictionaryType::dictionary. In the variable dictionary case, the dictionary is a property of an instance of an Array, instead, so it is only possible to know the dictionary if you have a realization of the data. You say this would "allow carrying around the currently known dictionary entries for "variable dictionaries" using the existing accessor" -- there are two cases to consider: * What happens when the dictionary gets bigger * What happens when the dictionary entries change order when observing the next piece of data, but the schema stays the same In such cases, what is the API for obtaining the dictionary? If the dictionary has changed then the prior state of the [immutable] DataType instance is no longer valid. To give another motivating example for variable dictionaries -- Antoine has developed a multi-threaded CSV reader [1]. The current requirement of static dictionaries makes parallel conversions that need to ultimately yield dictionary-encoded string columns both awkward and computationally costly. If the CSV conversion code is allowed to yield a different dictionary for each converted file chunk then things are much simpler. - Wes [1]: https://github.com/apache/arrow/tree/master/cpp/src/arrow/csv On Tue, Apr 30, 2019 at 12:05 PM Hatem Helal <hhe...@mathworks.com> wrote: > > Hi Wes, > > Thanks for the detailed writeup and I think this an important problem to > solve. I spent some time thinking about this when working on ARROW-3769 and > came to a similar conclusion that the current dictionary type was limiting > when doing partial reads of parquet files. > > I'm not sure if this makes sense but I wondered if you considered expanding > the DictionaryType to incorporate the variable or static states? I think > modelling these different states using the existing data type might be > simpler than introducing another data type. The implementation could be a > simple flag "is_static" passed as part of the constructor and this would > allow carrying around the currently known dictionary entries for "variable > dictionaries" using the existing accessor. I'd imagine that variable > dictionaries could compare equal so long as the data types of the index and > dictionary are equal. > > There are obvious backwards compatibility implications of such a change but I > don't know if arrow even makes such a guarantee. This might be a dangerous > change to make if clients have already written code that assumes that > DictionaryTypes are always static, but I see some simplicity in extending the > meaning of the existing data type. > > I'd be happy to take a stab at a draft PR for this if this idea sounds > promising and/or needs more fleshing out to have an opinion. > > Thanks, > > Hatem > > > On 4/29/19, 7:10 PM, "Wes McKinney" <wesmck...@gmail.com> wrote: > > hi all, > > There have been many discussions in passing on various issues and JIRA > tickets over the last months and years about how to manage > dictionary-encoded columnar arrays in-memory in C++. Here's a list of > some problems we have encountered: > > * Dictionaries that may differ from one record batch to another, but > represent semantically parts of the same dataset. For example, each > row group in Parquet format may have different dictionaries, and > decoding from encoded to "dense" representation may be undesirable > > * Support for dictionary "deltas" in the IPC protocol, using the > isDelta flag on DictionaryBatch [1]. It's conceivable we might want to > allow dictionaries to change altogether in the IPC protocol (where a > dictionary id appears again -- a replacement -- but isDelta = false) > > * Receiving a record batch schema without the dictionaries attached > (e.g. in Arrow Flight), see also experimental patch [2] > > * Encoded datasets may be produced in a distributed system where > "reconciling" the dictionary is not computationally feasible or > desirable (particularly if the encoded data is just going to be > aggregated) > > The reason that these are "problems" has to do with the way that we > represent dictionary encoded data in C++. We have created a > "synthetic" DictionaryType object that is used for Array/Column types > or for an entry in arrow::Schema. The DictionaryType wraps the index > type (signed integer type) and the dictionary itself. So from Python > we have > > >>> t = pa.dictionary(pa.int8(), pa.array(['a', 'b', 'c', 'd'])) > >>> t > DictionaryType(dictionary<values=string, indices=int8, ordered=0>) > >>> t.dictionary > <pyarrow.lib.StringArray object at 0x7eff4847fd68> > [ > "a", > "b", > "c", > "d" > ] > >>> t.index_type > DataType(int8) > > This is useful in languages like Python because we have the notion of > Categorical types which are a combination of a static dictionary and a > contiguous array of indices. > > It creates problems when we have changing dictionaries, because the > "schema" under this in-memory construction may change from record > batch to record batch. This means that types are deemed "unequal" > according to many code paths we've written. > > To consider solutions to this problem, I want to first point out that > the way we are dealing with dictionary-encoded data in memory is a > purely semantic construct for C++ and the binding languages. > "Dictionary" is not a data type as all according to the Arrow IPC > protocol -- it is a method is transferring encoded / compressed data, > and the handling thereof is left to the implementations. There are > benefits to the method we are using now, in particular it makes > dynamic dispatch (including the visitor pattern, and virtual > functions) based on whether something is encoded or not simple. It > also leads to simple round trips of Categorical types from libraries > like pandas. > > Here is my proposal to reconcile these issues in C++ > > * Add a new "synthetic" data type called "variable dictionary" to be > used alongside the existing "static dictionary" type. An instance of > VariableDictionaryType (name TBD) will not know what the dictionary > is, only the data type of the dictionary (e.g. utf8()) and the index > type (e.g. int32()) > > * Define common abstract API for instances of static vs variable > dictionary arrays. Mainly this means making > DictionaryArray::dictionary [3] virtual > > * The _actual_ dictionary values for a particular Array must be stored > somewhere and lifetime managed. I propose to put these as a single > entry in ArrayData::child_data [4]. An alternative to this would be to > modify ArrayData to have a dictionary field that would be unused > except for encoded datasets > > This proposal does create some ongoing implementation and maintenance > burden, but to that I would make these points: > > * Many algorithms will dispatch from one type to the other (probably > static dispatching to the variable path), so there will not be a need > to implement multiple times in most cases > > * In some algorithms, we may observe a stream of dictionary encoded > arrays, and we need only obtain the current dictionary as well as the > knowledge of whether it is the same as previous dictionaries. In hash > aggregations and other analytics I think we need to implement by > default under the assumption of dynamic/variable dictionaries > > I haven't conceived of any other ideas (after much contemplation) how > to algebraically accommodate these use cases in our object model so > interested in the opinions of others. As a first use case for this I > would be personally looking to address reads of encoded data from > Parquet format without an intermediate pass through dense format > (which can be slow and wasteful for heavily compressed string data) > > Thanks, > Wes > > [1]: > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches > [2]: https://github.com/apache/arrow/pull/4067 > [3]: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L968 > [4]: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L208 > >