Hi Wes,

Thanks for the detailed writeup and I think this an important problem to solve. 
 I spent some time thinking about this when working on ARROW-3769 and came to a 
similar conclusion that the current dictionary type was limiting when doing 
partial reads of parquet files.

I'm not sure if this makes sense but I wondered if you considered expanding the 
DictionaryType to incorporate the variable or static states?  I think modelling 
these different states using the existing data type might be simpler than 
introducing another data type.  The implementation could be a simple flag 
"is_static" passed as part of the constructor and this would allow carrying 
around the currently known dictionary entries for "variable dictionaries" using 
the existing accessor.  I'd imagine that variable dictionaries could compare 
equal so long as the data types of the index and dictionary are equal.

There are obvious backwards compatibility implications of such a change but I 
don't know if arrow even makes such a guarantee. This might be a dangerous 
change to make if clients have already written code that assumes that 
DictionaryTypes are always static, but I see some simplicity in extending the 
meaning of the existing data type.

I'd be happy to take a stab at a draft PR for this if this idea sounds 
promising and/or needs more fleshing out to have an opinion.

Thanks,

Hatem


On 4/29/19, 7:10 PM, "Wes McKinney" <[email protected]> wrote:

    hi all,
    
    There have been many discussions in passing on various issues and JIRA
    tickets over the last months and years about how to manage
    dictionary-encoded columnar arrays in-memory in C++. Here's a list of
    some problems we have encountered:
    
    * Dictionaries that may differ from one record batch to another, but
    represent semantically parts of the same dataset. For example, each
    row group in Parquet format may have different dictionaries, and
    decoding from encoded to "dense" representation may be undesirable
    
    * Support for dictionary "deltas" in the IPC protocol, using the
    isDelta flag on DictionaryBatch [1]. It's conceivable we might want to
    allow dictionaries to change altogether in the IPC protocol (where a
    dictionary id appears again -- a replacement -- but isDelta = false)
    
    * Receiving a record batch schema without the dictionaries attached
    (e.g. in Arrow Flight), see also experimental patch [2]
    
    * Encoded datasets may be produced in a distributed system where
    "reconciling" the dictionary is not computationally feasible or
    desirable (particularly if the encoded data is just going to be
    aggregated)
    
    The reason that these are "problems" has to do with the way that we
    represent dictionary encoded data in C++. We have created a
    "synthetic" DictionaryType object that is used for Array/Column types
    or for an entry in arrow::Schema. The DictionaryType wraps the index
    type (signed integer type) and the dictionary itself. So from Python
    we have
    
    >>> t = pa.dictionary(pa.int8(), pa.array(['a', 'b', 'c', 'd']))
    >>> t
    DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
    >>> t.dictionary
    <pyarrow.lib.StringArray object at 0x7eff4847fd68>
    [
      "a",
      "b",
      "c",
      "d"
    ]
    >>> t.index_type
    DataType(int8)
    
    This is useful in languages like Python because we have the notion of
    Categorical types which are a combination of a static dictionary and a
    contiguous array of indices.
    
    It creates problems when we have changing dictionaries, because the
    "schema" under this in-memory construction may change from record
    batch to record batch. This means that types are deemed "unequal"
    according to many code paths we've written.
    
    To consider solutions to this problem, I want to first point out that
    the way we are dealing with dictionary-encoded data in memory is a
    purely semantic construct for C++ and the binding languages.
    "Dictionary" is not a data type as all according to the Arrow IPC
    protocol -- it is a method is transferring encoded / compressed data,
    and the handling thereof is left to the implementations. There are
    benefits to the method we are using now, in particular it makes
    dynamic dispatch (including the visitor pattern, and virtual
    functions) based on whether something is encoded or not simple. It
    also leads to simple round trips of Categorical types from libraries
    like pandas.
    
    Here is my proposal to reconcile these issues in C++
    
    * Add a new "synthetic" data type called "variable dictionary" to be
    used alongside the existing "static dictionary" type. An instance of
    VariableDictionaryType (name TBD) will not know what the dictionary
    is, only the data type of the dictionary (e.g. utf8()) and the index
    type (e.g. int32())
    
    * Define common abstract API for instances of static vs variable
    dictionary arrays. Mainly this means making
    DictionaryArray::dictionary [3] virtual
    
    * The _actual_ dictionary values for a particular Array must be stored
    somewhere and lifetime managed. I propose to put these as a single
    entry in ArrayData::child_data [4]. An alternative to this would be to
    modify ArrayData to have a dictionary field that would be unused
    except for encoded datasets
    
    This proposal does create some ongoing implementation and maintenance
    burden, but to that I would make these points:
    
    * Many algorithms will dispatch from one type to the other (probably
    static dispatching to the variable path), so there will not be a need
    to implement multiple times in most cases
    
    * In some algorithms, we may observe a stream of dictionary encoded
    arrays, and we need only obtain the current dictionary as well as the
    knowledge of whether it is the same as previous dictionaries. In hash
    aggregations and other analytics I think we need to implement by
    default under the assumption of dynamic/variable dictionaries
    
    I haven't conceived of any other ideas (after much contemplation) how
    to algebraically accommodate these use cases in our object model so
    interested in the opinions of others. As a first use case for this I
    would be personally looking to address reads of encoded data from
    Parquet format without an intermediate pass through dense format
    (which can be slow and wasteful for heavily compressed string data)
    
    Thanks,
    Wes
    
    [1]: 
https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches
    [2]: https://github.com/apache/arrow/pull/4067
    [3]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L968
    [4]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L208
    

Reply via email to