Having dictionaries-within-dictionaries does add some complexity, but I think the use case is valid and so it would be good to determine the best way to handle this in the IPC / messaging protocol.
I would suggest: dictionaries can use other dictionaries, so long as those dictionaries occur earlier in the stream. I am not sure either the Java or C++ libraries will be able to properly handle these cases right now, but that's what we have integration tests for! On Fri, Apr 6, 2018 at 11:59 AM, Uwe L. Korn <uw...@xhochy.com> wrote: > Hello Brian, > > I would also have considered this a legitimate use of the Arrow > specification. We only specify the DictionaryType to have a dictionary of any > Arrow Type. In the context of Arrow's IPC this seems to be a bit more > complicated as we seem to have the assumption that there is only one type of > Dictionary per column. I would argue that we should be able to support this > once we work out a reliable way to transfer them via the IPC mechanism. > > Just as a related thought (might not produce the result you want): In > Parquet, only the values on the lowest level are dictionary-encoded. But this > is also due to the fact that Parquet uses repetition and definition levels to > encode arbitrarily nested data types. These are more space-efficient when > they are correctly encoded but don't provide random access. > > Uwe > > On Fri, Apr 6, 2018, at 4:42 PM, Brian Hulette wrote: >> I've been considering a use-case with a dictionary-encoded struct >> column, which may contain some dictionary-encoded columns itself. More >> specifically, in this use-case each row represents a single observation >> in a geospatial track, which includes a position, a time, and some >> track-level metadata (track id, origin, destination, etc...). I would >> like to represent the metadata as a dictionary-encoded struct, since >> unique values will be repeated for each observation of that track, and I >> would _also_ like to dictionary-encode some of the metadata column's >> children, since unique values will typically be repeated in multiple tracks. >> >> I think one could make a (totally legitimate) argument that this is >> stretching a format designed for tabular data too far. This use-case >> could also be accomplished by breaking out the struct metadata column >> into its own arrow table, and managing a new integer column that >> references that table. This would look almost identical to what I >> initially described, it just wouldn't rely on the arrow libraries to >> manage the "dictionary". >> >> >> The spec doesn't have anything to say on this topic as far as I can >> tell, but our implementations don't currently allow a dictionary-encoded >> column's children to be dictionary-encoded themselves [1]. Is this just >> a simplifying assumption, or a hard rule that should be codified in the >> spec? >> >> Thanks, >> Brian >> >> [1] >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L824