> people generally find use in Arrow schemas independently of concrete data.

This makes sense.  I think we do want to encourage use of Arrow as a "type
system" even if there is no data involved.  And, given that we cannot
easily change a field's data type property to "optional" it makes sense to
use a dedicated type and I so I would be in favor of such a proposal (we
may eventually add an "unknown type" concept in Substrait as well, it's
come up several times, and so we could use this in that context).

I think that I would still prefer a canonical extension type (with storage
type null) over a new dedicated type.

On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Ah! Well, I think this could be an interesting proposal, but someone
> should put a more formal proposal, perhaps as a draft PR.
>
> Regards
>
> Antoine.
>
>
> Le 17/04/2024 à 11:57, David Li a écrit :
> > For an unsupported/other extension type.
> >
> > On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote:
> >> What is "this proposal"?
> >>
> >>
> >> Le 17/04/2024 à 10:38, David Li a écrit :
> >>> Should I take it that this proposal is dead in the water? While we
> could define our own Unknown/Other type for say the ADBC PostgreSQL driver
> it might be useful to have a singular type for consumers to latch on to.
> >>>
> >>> On Fri, Apr 12, 2024, at 07:32, David Li wrote:
> >>>> I think an "Other" extension type is slightly different than an
> >>>> arbitrary extension type, though: the latter may be understood
> >>>> downstream but the former represents a point at which a component
> >>>> explicitly declares it does not know how to handle a field. In this
> >>>> example, the PostgreSQL ADBC driver might be able to provide a
> >>>> representation regardless, but a different driver (or say, the JDBC
> >>>> adapter, which cannot necessarily get a bytestring for an arbitrary
> >>>> JDBC type) may want an Other type to signal that it would fail if
> asked
> >>>> to provide particular columns.
> >>>>
> >>>> On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote:
> >>>>> Depending where your Arrow-encoded data is used, either extension
> >>>>> types or generic field metadata are options. We have this problem in
> >>>>> the ADBC Postgres driver, where we can convert *most* Postgres types
> >>>>> to an Arrow type but there are some others where we can't or don't
> >>>>> know or don't implement a conversion. Currently for these we return
> >>>>> opaque binary (the Postgres COPY representation of the value) but put
> >>>>> field metadata so that a consumer can implement a workaround for an
> >>>>> unsupported type. It would be arguably better to have implemented
> this
> >>>>> as an extension type; however, field metadata felt like less of a
> >>>>> commitment when I first worked on this.
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> -dewey
> >>>>>
> >>>>> On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
> >>>>> <norman.jor...@improving.com.invalid> wrote:
> >>>>>>
> >>>>>> I was using UUID as an example. It looks like extension types
> covers my original request.
> >>>>>> ________________________________
> >>>>>> From: Felipe Oliveira Carvalho <felipe...@gmail.com>
> >>>>>> Sent: Thursday, April 11, 2024 7:15 AM
> >>>>>> To: dev@arrow.apache.org <dev@arrow.apache.org>
> >>>>>> Subject: Re: Unsupported/Other Type
> >>>>>>
> >>>>>> The OP used UUID as an example. Would that be enough or the request
> is for
> >>>>>> a flexible mechanism that allows the creation of one-off nominal
> types for
> >>>>>> very specific use-cases?
> >>>>>>
> >>>>>> —
> >>>>>> Felipe
> >>>>>>
> >>>>>> On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou <anto...@python.org>
> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> Yes, JSON and UUID are obvious candidates for new canonical
> extension
> >>>>>>> types. XML also comes to mind, but I'm not sure there's much of a
> use
> >>>>>>> case for it.
> >>>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Antoine.
> >>>>>>>
> >>>>>>>
> >>>>>>> Le 10/04/2024 à 22:55, Wes McKinney a écrit :
> >>>>>>>> In the past we have discussed adding a canonical type for UUID
> and JSON.
> >>>>>>> I
> >>>>>>>> still think this is a good idea and could improve ergonomics in
> >>>>>>> downstream
> >>>>>>>> language bindings (e.g. by exposing JSON querying function or
> >>>>>>> automatically
> >>>>>>>> boxing UUIDs in built-in UUID types, like the Python uuid
> library). Has
> >>>>>>>> anyone done any work on this to anyone's knowledge?
> >>>>>>>>
> >>>>>>>> On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <
> emkornfi...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Norman,
> >>>>>>>>> Arrow has a concept of extension types [1] along with the
> possibility of
> >>>>>>>>> proposing new canonical extension types [2].  This seems to
> cover the
> >>>>>>>>> use-cases you mention but I might be misunderstanding?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Micah
> >>>>>>>>>
> >>>>>>>>> [1]
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
> >>>>>>>>> [2]
> https://arrow.apache.org/docs/format/CanonicalExtensions.html
> >>>>>>>>>
> >>>>>>>>> On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
> >>>>>>>>> <norman.jor...@improving.com.invalid> wrote:
> >>>>>>>>>
> >>>>>>>>>> Problem Description
> >>>>>>>>>>
> >>>>>>>>>> Currently Arrow schemas can only contain columns of types
> supported by
> >>>>>>>>>> Arrow. In some cases an Arrow schema maps to an external
> schema. This
> >>>>>>> can
> >>>>>>>>>> result in the Arrow schema not being able to support all the
> columns
> >>>>>>> from
> >>>>>>>>>> the external schema.
> >>>>>>>>>>
> >>>>>>>>>> Consider an external system that contains a column of type
> UUID. To
> >>>>>>> model
> >>>>>>>>>> the schema in Arrow, the user has two choices:
> >>>>>>>>>>
> >>>>>>>>>>      1.  Do not include the UUID column in the Arrow schema
> >>>>>>>>>>
> >>>>>>>>>>      2.  Map the column to an existing Arrow type. This will
> not include
> >>>>>>> the
> >>>>>>>>>> original type information. A UUID can be mapped to a
> FixedSizeBinary,
> >>>>>>> but
> >>>>>>>>>> consumers of the Arrow schema will be unable to distinguish a
> >>>>>>>>>> FixedSizeBinary field from a UUID field.
> >>>>>>>>>>
> >>>>>>>>>> Possible Solution
> >>>>>>>>>>
> >>>>>>>>>>      *   Add a new type code that represents unsupported types
> >>>>>>>>>>
> >>>>>>>>>>      *   Values for the new type are represented as variable
> length
> >>>>>>> binary
> >>>>>>>>>>
> >>>>>>>>>> Some drivers can expose data even when they don’t understand
> the data
> >>>>>>>>>> type. For example, the PostgreSQL driver will return the raw
> bytes for
> >>>>>>>>>> fields of an unknown type. Using an explicit type lets clients
> know
> >>>>>>> that
> >>>>>>>>>> they should convert values if they were able to determine the
> actual
> >>>>>>> data
> >>>>>>>>>> type.
> >>>>>>>>>>
> >>>>>>>>>> Questions
> >>>>>>>>>>
> >>>>>>>>>>      *   What is the impact on existing clients when they
> encounter
> >>>>>>> fields
> >>>>>>>>> of
> >>>>>>>>>> the unsupported type?
> >>>>>>>>>>
> >>>>>>>>>>      *   Is it safe to assume that all unsupported values can
> safely be
> >>>>>>>>>> converted to a variable length binary?
> >>>>>>>>>>
> >>>>>>>>>>      *   How can we preserve information about the original
> type?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>> Warning: The sender of this message could not be validated and may
> not be the actual sender.
>

Reply via email to