Re: Unsupported/Other Type

David Li Wed, 17 Apr 2024 07:36:42 -0700

Yes, this would be for an extension type.


On Wed, Apr 17, 2024, at 23:25, Weston Pace wrote:
>> people generally find use in Arrow schemas independently of concrete data.
>
> This makes sense.  I think we do want to encourage use of Arrow as a "type
> system" even if there is no data involved.  And, given that we cannot
> easily change a field's data type property to "optional" it makes sense to
> use a dedicated type and I so I would be in favor of such a proposal (we
> may eventually add an "unknown type" concept in Substrait as well, it's
> come up several times, and so we could use this in that context).
>
> I think that I would still prefer a canonical extension type (with storage
> type null) over a new dedicated type.
>
> On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou <anto...@python.org> wrote:
>
>>
>> Ah! Well, I think this could be an interesting proposal, but someone
>> should put a more formal proposal, perhaps as a draft PR.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 17/04/2024 à 11:57, David Li a écrit :
>> > For an unsupported/other extension type.
>> >
>> > On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote:
>> >> What is "this proposal"?
>> >>
>> >>
>> >> Le 17/04/2024 à 10:38, David Li a écrit :
>> >>> Should I take it that this proposal is dead in the water? While we
>> could define our own Unknown/Other type for say the ADBC PostgreSQL driver
>> it might be useful to have a singular type for consumers to latch on to.
>> >>>
>> >>> On Fri, Apr 12, 2024, at 07:32, David Li wrote:
>> >>>> I think an "Other" extension type is slightly different than an
>> >>>> arbitrary extension type, though: the latter may be understood
>> >>>> downstream but the former represents a point at which a component
>> >>>> explicitly declares it does not know how to handle a field. In this
>> >>>> example, the PostgreSQL ADBC driver might be able to provide a
>> >>>> representation regardless, but a different driver (or say, the JDBC
>> >>>> adapter, which cannot necessarily get a bytestring for an arbitrary
>> >>>> JDBC type) may want an Other type to signal that it would fail if
>> asked
>> >>>> to provide particular columns.
>> >>>>
>> >>>> On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote:
>> >>>>> Depending where your Arrow-encoded data is used, either extension
>> >>>>> types or generic field metadata are options. We have this problem in
>> >>>>> the ADBC Postgres driver, where we can convert *most* Postgres types
>> >>>>> to an Arrow type but there are some others where we can't or don't
>> >>>>> know or don't implement a conversion. Currently for these we return
>> >>>>> opaque binary (the Postgres COPY representation of the value) but put
>> >>>>> field metadata so that a consumer can implement a workaround for an
>> >>>>> unsupported type. It would be arguably better to have implemented
>> this
>> >>>>> as an extension type; however, field metadata felt like less of a
>> >>>>> commitment when I first worked on this.
>> >>>>>
>> >>>>> Cheers,
>> >>>>>
>> >>>>> -dewey
>> >>>>>
>> >>>>> On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
>> >>>>> <norman.jor...@improving.com.invalid> wrote:
>> >>>>>>
>> >>>>>> I was using UUID as an example. It looks like extension types
>> covers my original request.
>> >>>>>> ________________________________
>> >>>>>> From: Felipe Oliveira Carvalho <felipe...@gmail.com>
>> >>>>>> Sent: Thursday, April 11, 2024 7:15 AM
>> >>>>>> To: dev@arrow.apache.org <dev@arrow.apache.org>
>> >>>>>> Subject: Re: Unsupported/Other Type
>> >>>>>>
>> >>>>>> The OP used UUID as an example. Would that be enough or the request
>> is for
>> >>>>>> a flexible mechanism that allows the creation of one-off nominal
>> types for
>> >>>>>> very specific use-cases?
>> >>>>>>
>> >>>>>> —
>> >>>>>> Felipe
>> >>>>>>
>> >>>>>> On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou <anto...@python.org>
>> wrote:
>> >>>>>>
>> >>>>>>>
>> >>>>>>> Yes, JSON and UUID are obvious candidates for new canonical
>> extension
>> >>>>>>> types. XML also comes to mind, but I'm not sure there's much of a
>> use
>> >>>>>>> case for it.
>> >>>>>>>
>> >>>>>>> Regards
>> >>>>>>>
>> >>>>>>> Antoine.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Le 10/04/2024 à 22:55, Wes McKinney a écrit :
>> >>>>>>>> In the past we have discussed adding a canonical type for UUID
>> and JSON.
>> >>>>>>> I
>> >>>>>>>> still think this is a good idea and could improve ergonomics in
>> >>>>>>> downstream
>> >>>>>>>> language bindings (e.g. by exposing JSON querying function or
>> >>>>>>> automatically
>> >>>>>>>> boxing UUIDs in built-in UUID types, like the Python uuid
>> library). Has
>> >>>>>>>> anyone done any work on this to anyone's knowledge?
>> >>>>>>>>
>> >>>>>>>> On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <
>> emkornfi...@gmail.com>
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Hi Norman,
>> >>>>>>>>> Arrow has a concept of extension types [1] along with the
>> possibility of
>> >>>>>>>>> proposing new canonical extension types [2].  This seems to
>> cover the
>> >>>>>>>>> use-cases you mention but I might be misunderstanding?
>> >>>>>>>>>
>> >>>>>>>>> Thanks,
>> >>>>>>>>> Micah
>> >>>>>>>>>
>> >>>>>>>>> [1]
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
>> >>>>>>>>> [2]
>> https://arrow.apache.org/docs/format/CanonicalExtensions.html
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
>> >>>>>>>>> <norman.jor...@improving.com.invalid> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Problem Description
>> >>>>>>>>>>
>> >>>>>>>>>> Currently Arrow schemas can only contain columns of types
>> supported by
>> >>>>>>>>>> Arrow. In some cases an Arrow schema maps to an external
>> schema. This
>> >>>>>>> can
>> >>>>>>>>>> result in the Arrow schema not being able to support all the
>> columns
>> >>>>>>> from
>> >>>>>>>>>> the external schema.
>> >>>>>>>>>>
>> >>>>>>>>>> Consider an external system that contains a column of type
>> UUID. To
>> >>>>>>> model
>> >>>>>>>>>> the schema in Arrow, the user has two choices:
>> >>>>>>>>>>
>> >>>>>>>>>>      1.  Do not include the UUID column in the Arrow schema
>> >>>>>>>>>>
>> >>>>>>>>>>      2.  Map the column to an existing Arrow type. This will
>> not include
>> >>>>>>> the
>> >>>>>>>>>> original type information. A UUID can be mapped to a
>> FixedSizeBinary,
>> >>>>>>> but
>> >>>>>>>>>> consumers of the Arrow schema will be unable to distinguish a
>> >>>>>>>>>> FixedSizeBinary field from a UUID field.
>> >>>>>>>>>>
>> >>>>>>>>>> Possible Solution
>> >>>>>>>>>>
>> >>>>>>>>>>      *   Add a new type code that represents unsupported types
>> >>>>>>>>>>
>> >>>>>>>>>>      *   Values for the new type are represented as variable
>> length
>> >>>>>>> binary
>> >>>>>>>>>>
>> >>>>>>>>>> Some drivers can expose data even when they don’t understand
>> the data
>> >>>>>>>>>> type. For example, the PostgreSQL driver will return the raw
>> bytes for
>> >>>>>>>>>> fields of an unknown type. Using an explicit type lets clients
>> know
>> >>>>>>> that
>> >>>>>>>>>> they should convert values if they were able to determine the
>> actual
>> >>>>>>> data
>> >>>>>>>>>> type.
>> >>>>>>>>>>
>> >>>>>>>>>> Questions
>> >>>>>>>>>>
>> >>>>>>>>>>      *   What is the impact on existing clients when they
>> encounter
>> >>>>>>> fields
>> >>>>>>>>> of
>> >>>>>>>>>> the unsupported type?
>> >>>>>>>>>>
>> >>>>>>>>>>      *   Is it safe to assume that all unsupported values can
>> safely be
>> >>>>>>>>>> converted to a variable length binary?
>> >>>>>>>>>>
>> >>>>>>>>>>      *   How can we preserve information about the original
>> type?
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>> Warning: The sender of this message could not be validated and may
>> not be the actual sender.
>>

Re: Unsupported/Other Type

Reply via email to