I opened https://issues.apache.org/jira/browse/ARROW-8079 about the
Python question

On Wed, Mar 11, 2020 at 2:53 PM Neal Richardson
<neal.p.richard...@gmail.com> wrote:
>
> While the underlying storage may allow duplicate keys, it seems much more
> likely that someone would end up with duplicate keys by accident than by
> design. And although it may be up to the implementations to determine or
> enforce uniqueness constraints, it might be a good idea to make a
> project-level statement about this (a "should" rather than "must"
> specification).
>
> As for the integration test JSON format, "a list of key/value pairs like
> {"key": $key, "value": $value}" does make sense, not so much because we
> should support duplicated keys but because we need to preserve order, as
> Wes pointed out.
>
> Neal
>
> On Wed, Mar 11, 2020 at 12:36 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > On Wed, Mar 11, 2020 at 2:22 PM Antoine Pitrou <solip...@pitrou.net>
> > wrote:
> > >
> > > On Wed, 11 Mar 2020 12:44:26 -0500
> > > Wes McKinney <wesmck...@gmail.com> wrote:
> > > > On this note, in Python we should probably re-evaluate the data
> > > > structure returned when accessing the "metadata" field.
> > >
> > > I think it's ok for the convenience API to return a dict, if we also
> > > expose e.g. a "metadata_items" that returns an iterable of key/value
> > > pairs.
> >
> > The alternative would be a dict-like object that implements some of
> > the dict APIs. Might also be useful to expose KeyValueMetadata::Merge
> > in Python
> >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > >
> > > > On Wed, Mar 11, 2020 at 12:42 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > > > >
> > > > > In the C++ library at least, uniqueness is never asserted when
> > reading
> > > > > and writing the IPC metadata [1] [2]. If you use
> > > > > KeyValueMetadata::FindKey and the keys are non-unique, it will return
> > > > > the first one it finds. KeyValueMetadata::Merge assumes uniqueness,
> > > > > and the KeyValueMetadata::ToUnorderedMap function will drop all but
> > > > > one duplicate.
> > > > >
> > > > > In Parquet, the metadata is also a list of KeyValue pairs with no
> > > > > qualifications [3]
> > > > >
> > > > > My weak preference is to leave it to applications to make assertions
> > > > > about uniqueness. In either case since the metadata is ordered in the
> > > > > integration tests it would make sense to serialize as a list of
> > > > > key/value pairs like {"key": $key, "value": $value}
> > > > >
> > > > > [1]:
> > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/metadata_internal.cc#L463
> > > > > [2]:
> > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/metadata_internal.cc#L471
> > > > > [3]:
> > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > > > >
> > > > > On Wed, Mar 11, 2020 at 12:11 PM Ben Kietzman <
> > ben.kietz...@rstudio.com> wrote:
> > > > > >
> > > > > > While working on https://issues.apache.org/jira/browse/ARROW-2255
> > > > > > (serialize custom_metadata in the integration tests), we had the
> > following
> > > > > > discussion on GitHub:
> > > > > >
> > https://github.com/apache/arrow/pull/6556#pullrequestreview-372405940
> > > > > >
> > > > > > In short, although in Schema.fbs custom_metadata is declared as an
> > array of
> > > > > > KeyValue pairs (so duplicate keys would be possible), all reference
> > > > > > implementations assume it to represent an associative map with
> > unique keys.
> > > > > >
> > > > > > Is there a use case for duplicate metadata keys? It seems that an
> > > > > > acceptable resolution might be to note in Schema.fbs that
> > implementations
> > > > > > are allowed to assume that keys are unique
> > > > > >
> > > > > > Ben
> > > >
> > >
> > >
> > >
> >

Reply via email to