Hello Brian,

I would also have considered this a legitimate use of the Arrow specification. 
We only specify the DictionaryType to have a dictionary of any Arrow Type. In 
the context of Arrow's IPC this seems to be a bit more complicated as we seem 
to have the assumption that there is only one type of Dictionary per column. I 
would argue that we should be able to support this once we work out a reliable 
way to transfer them via the IPC mechanism.

Just as a related thought (might not produce the result you want): In Parquet, 
only the values on the lowest level are dictionary-encoded. But this is also 
due to the fact that Parquet uses repetition and definition levels to encode 
arbitrarily nested data types. These are more space-efficient when they are 
correctly encoded but don't provide random access.

Uwe

On Fri, Apr 6, 2018, at 4:42 PM, Brian Hulette wrote:
> I've been considering a use-case with a dictionary-encoded struct 
> column, which may contain some dictionary-encoded columns itself. More 
> specifically, in this use-case each row represents a single observation 
> in a geospatial track, which includes a position, a time, and some 
> track-level metadata (track id, origin, destination, etc...). I would 
> like to represent the metadata as a dictionary-encoded struct, since 
> unique values will be repeated for each observation of that track, and I 
> would _also_ like to dictionary-encode some of the metadata column's 
> children, since unique values will typically be repeated in multiple tracks.
> 
> I think one could make a (totally legitimate) argument that this is 
> stretching a format designed for tabular data too far. This use-case 
> could also be accomplished by breaking out the struct metadata column 
> into its own arrow table, and managing a new integer column that 
> references that table. This would look almost identical to what I 
> initially described, it just wouldn't rely on the arrow libraries to 
> manage the "dictionary".
> 
> 
> The spec doesn't have anything to say on this topic as far as I can 
> tell, but our implementations don't currently allow a dictionary-encoded 
> column's children to be dictionary-encoded themselves [1]. Is this just 
> a simplifying assumption, or a hard rule that should be codified in the 
> spec?
> 
> Thanks,
> Brian
> 
> [1] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L824

Reply via email to