Yes, this would be for an extension type.
On Wed, Apr 17, 2024, at 23:25, Weston Pace wrote: >> people generally find use in Arrow schemas independently of concrete data. > > This makes sense. I think we do want to encourage use of Arrow as a "type > system" even if there is no data involved. And, given that we cannot > easily change a field's data type property to "optional" it makes sense to > use a dedicated type and I so I would be in favor of such a proposal (we > may eventually add an "unknown type" concept in Substrait as well, it's > come up several times, and so we could use this in that context). > > I think that I would still prefer a canonical extension type (with storage > type null) over a new dedicated type. > > On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou <anto...@python.org> wrote: > >> >> Ah! Well, I think this could be an interesting proposal, but someone >> should put a more formal proposal, perhaps as a draft PR. >> >> Regards >> >> Antoine. >> >> >> Le 17/04/2024 à 11:57, David Li a écrit : >> > For an unsupported/other extension type. >> > >> > On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote: >> >> What is "this proposal"? >> >> >> >> >> >> Le 17/04/2024 à 10:38, David Li a écrit : >> >>> Should I take it that this proposal is dead in the water? While we >> could define our own Unknown/Other type for say the ADBC PostgreSQL driver >> it might be useful to have a singular type for consumers to latch on to. >> >>> >> >>> On Fri, Apr 12, 2024, at 07:32, David Li wrote: >> >>>> I think an "Other" extension type is slightly different than an >> >>>> arbitrary extension type, though: the latter may be understood >> >>>> downstream but the former represents a point at which a component >> >>>> explicitly declares it does not know how to handle a field. In this >> >>>> example, the PostgreSQL ADBC driver might be able to provide a >> >>>> representation regardless, but a different driver (or say, the JDBC >> >>>> adapter, which cannot necessarily get a bytestring for an arbitrary >> >>>> JDBC type) may want an Other type to signal that it would fail if >> asked >> >>>> to provide particular columns. >> >>>> >> >>>> On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote: >> >>>>> Depending where your Arrow-encoded data is used, either extension >> >>>>> types or generic field metadata are options. We have this problem in >> >>>>> the ADBC Postgres driver, where we can convert *most* Postgres types >> >>>>> to an Arrow type but there are some others where we can't or don't >> >>>>> know or don't implement a conversion. Currently for these we return >> >>>>> opaque binary (the Postgres COPY representation of the value) but put >> >>>>> field metadata so that a consumer can implement a workaround for an >> >>>>> unsupported type. It would be arguably better to have implemented >> this >> >>>>> as an extension type; however, field metadata felt like less of a >> >>>>> commitment when I first worked on this. >> >>>>> >> >>>>> Cheers, >> >>>>> >> >>>>> -dewey >> >>>>> >> >>>>> On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan >> >>>>> <norman.jor...@improving.com.invalid> wrote: >> >>>>>> >> >>>>>> I was using UUID as an example. It looks like extension types >> covers my original request. >> >>>>>> ________________________________ >> >>>>>> From: Felipe Oliveira Carvalho <felipe...@gmail.com> >> >>>>>> Sent: Thursday, April 11, 2024 7:15 AM >> >>>>>> To: dev@arrow.apache.org <dev@arrow.apache.org> >> >>>>>> Subject: Re: Unsupported/Other Type >> >>>>>> >> >>>>>> The OP used UUID as an example. Would that be enough or the request >> is for >> >>>>>> a flexible mechanism that allows the creation of one-off nominal >> types for >> >>>>>> very specific use-cases? >> >>>>>> >> >>>>>> — >> >>>>>> Felipe >> >>>>>> >> >>>>>> On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou <anto...@python.org> >> wrote: >> >>>>>> >> >>>>>>> >> >>>>>>> Yes, JSON and UUID are obvious candidates for new canonical >> extension >> >>>>>>> types. XML also comes to mind, but I'm not sure there's much of a >> use >> >>>>>>> case for it. >> >>>>>>> >> >>>>>>> Regards >> >>>>>>> >> >>>>>>> Antoine. >> >>>>>>> >> >>>>>>> >> >>>>>>> Le 10/04/2024 à 22:55, Wes McKinney a écrit : >> >>>>>>>> In the past we have discussed adding a canonical type for UUID >> and JSON. >> >>>>>>> I >> >>>>>>>> still think this is a good idea and could improve ergonomics in >> >>>>>>> downstream >> >>>>>>>> language bindings (e.g. by exposing JSON querying function or >> >>>>>>> automatically >> >>>>>>>> boxing UUIDs in built-in UUID types, like the Python uuid >> library). Has >> >>>>>>>> anyone done any work on this to anyone's knowledge? >> >>>>>>>> >> >>>>>>>> On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield < >> emkornfi...@gmail.com> >> >>>>>>>> wrote: >> >>>>>>>> >> >>>>>>>>> Hi Norman, >> >>>>>>>>> Arrow has a concept of extension types [1] along with the >> possibility of >> >>>>>>>>> proposing new canonical extension types [2]. This seems to >> cover the >> >>>>>>>>> use-cases you mention but I might be misunderstanding? >> >>>>>>>>> >> >>>>>>>>> Thanks, >> >>>>>>>>> Micah >> >>>>>>>>> >> >>>>>>>>> [1] >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types >> >>>>>>>>> [2] >> https://arrow.apache.org/docs/format/CanonicalExtensions.html >> >>>>>>>>> >> >>>>>>>>> On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan >> >>>>>>>>> <norman.jor...@improving.com.invalid> wrote: >> >>>>>>>>> >> >>>>>>>>>> Problem Description >> >>>>>>>>>> >> >>>>>>>>>> Currently Arrow schemas can only contain columns of types >> supported by >> >>>>>>>>>> Arrow. In some cases an Arrow schema maps to an external >> schema. This >> >>>>>>> can >> >>>>>>>>>> result in the Arrow schema not being able to support all the >> columns >> >>>>>>> from >> >>>>>>>>>> the external schema. >> >>>>>>>>>> >> >>>>>>>>>> Consider an external system that contains a column of type >> UUID. To >> >>>>>>> model >> >>>>>>>>>> the schema in Arrow, the user has two choices: >> >>>>>>>>>> >> >>>>>>>>>> 1. Do not include the UUID column in the Arrow schema >> >>>>>>>>>> >> >>>>>>>>>> 2. Map the column to an existing Arrow type. This will >> not include >> >>>>>>> the >> >>>>>>>>>> original type information. A UUID can be mapped to a >> FixedSizeBinary, >> >>>>>>> but >> >>>>>>>>>> consumers of the Arrow schema will be unable to distinguish a >> >>>>>>>>>> FixedSizeBinary field from a UUID field. >> >>>>>>>>>> >> >>>>>>>>>> Possible Solution >> >>>>>>>>>> >> >>>>>>>>>> * Add a new type code that represents unsupported types >> >>>>>>>>>> >> >>>>>>>>>> * Values for the new type are represented as variable >> length >> >>>>>>> binary >> >>>>>>>>>> >> >>>>>>>>>> Some drivers can expose data even when they don’t understand >> the data >> >>>>>>>>>> type. For example, the PostgreSQL driver will return the raw >> bytes for >> >>>>>>>>>> fields of an unknown type. Using an explicit type lets clients >> know >> >>>>>>> that >> >>>>>>>>>> they should convert values if they were able to determine the >> actual >> >>>>>>> data >> >>>>>>>>>> type. >> >>>>>>>>>> >> >>>>>>>>>> Questions >> >>>>>>>>>> >> >>>>>>>>>> * What is the impact on existing clients when they >> encounter >> >>>>>>> fields >> >>>>>>>>> of >> >>>>>>>>>> the unsupported type? >> >>>>>>>>>> >> >>>>>>>>>> * Is it safe to assume that all unsupported values can >> safely be >> >>>>>>>>>> converted to a variable length binary? >> >>>>>>>>>> >> >>>>>>>>>> * How can we preserve information about the original >> type? >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> Warning: The sender of this message could not be validated and may >> not be the actual sender. >>