> people generally find use in Arrow schemas independently of concrete data.
This makes sense. I think we do want to encourage use of Arrow as a "type system" even if there is no data involved. And, given that we cannot easily change a field's data type property to "optional" it makes sense to use a dedicated type and I so I would be in favor of such a proposal (we may eventually add an "unknown type" concept in Substrait as well, it's come up several times, and so we could use this in that context). I think that I would still prefer a canonical extension type (with storage type null) over a new dedicated type. On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou <anto...@python.org> wrote: > > Ah! Well, I think this could be an interesting proposal, but someone > should put a more formal proposal, perhaps as a draft PR. > > Regards > > Antoine. > > > Le 17/04/2024 à 11:57, David Li a écrit : > > For an unsupported/other extension type. > > > > On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote: > >> What is "this proposal"? > >> > >> > >> Le 17/04/2024 à 10:38, David Li a écrit : > >>> Should I take it that this proposal is dead in the water? While we > could define our own Unknown/Other type for say the ADBC PostgreSQL driver > it might be useful to have a singular type for consumers to latch on to. > >>> > >>> On Fri, Apr 12, 2024, at 07:32, David Li wrote: > >>>> I think an "Other" extension type is slightly different than an > >>>> arbitrary extension type, though: the latter may be understood > >>>> downstream but the former represents a point at which a component > >>>> explicitly declares it does not know how to handle a field. In this > >>>> example, the PostgreSQL ADBC driver might be able to provide a > >>>> representation regardless, but a different driver (or say, the JDBC > >>>> adapter, which cannot necessarily get a bytestring for an arbitrary > >>>> JDBC type) may want an Other type to signal that it would fail if > asked > >>>> to provide particular columns. > >>>> > >>>> On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote: > >>>>> Depending where your Arrow-encoded data is used, either extension > >>>>> types or generic field metadata are options. We have this problem in > >>>>> the ADBC Postgres driver, where we can convert *most* Postgres types > >>>>> to an Arrow type but there are some others where we can't or don't > >>>>> know or don't implement a conversion. Currently for these we return > >>>>> opaque binary (the Postgres COPY representation of the value) but put > >>>>> field metadata so that a consumer can implement a workaround for an > >>>>> unsupported type. It would be arguably better to have implemented > this > >>>>> as an extension type; however, field metadata felt like less of a > >>>>> commitment when I first worked on this. > >>>>> > >>>>> Cheers, > >>>>> > >>>>> -dewey > >>>>> > >>>>> On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan > >>>>> <norman.jor...@improving.com.invalid> wrote: > >>>>>> > >>>>>> I was using UUID as an example. It looks like extension types > covers my original request. > >>>>>> ________________________________ > >>>>>> From: Felipe Oliveira Carvalho <felipe...@gmail.com> > >>>>>> Sent: Thursday, April 11, 2024 7:15 AM > >>>>>> To: dev@arrow.apache.org <dev@arrow.apache.org> > >>>>>> Subject: Re: Unsupported/Other Type > >>>>>> > >>>>>> The OP used UUID as an example. Would that be enough or the request > is for > >>>>>> a flexible mechanism that allows the creation of one-off nominal > types for > >>>>>> very specific use-cases? > >>>>>> > >>>>>> — > >>>>>> Felipe > >>>>>> > >>>>>> On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou <anto...@python.org> > wrote: > >>>>>> > >>>>>>> > >>>>>>> Yes, JSON and UUID are obvious candidates for new canonical > extension > >>>>>>> types. XML also comes to mind, but I'm not sure there's much of a > use > >>>>>>> case for it. > >>>>>>> > >>>>>>> Regards > >>>>>>> > >>>>>>> Antoine. > >>>>>>> > >>>>>>> > >>>>>>> Le 10/04/2024 à 22:55, Wes McKinney a écrit : > >>>>>>>> In the past we have discussed adding a canonical type for UUID > and JSON. > >>>>>>> I > >>>>>>>> still think this is a good idea and could improve ergonomics in > >>>>>>> downstream > >>>>>>>> language bindings (e.g. by exposing JSON querying function or > >>>>>>> automatically > >>>>>>>> boxing UUIDs in built-in UUID types, like the Python uuid > library). Has > >>>>>>>> anyone done any work on this to anyone's knowledge? > >>>>>>>> > >>>>>>>> On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield < > emkornfi...@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi Norman, > >>>>>>>>> Arrow has a concept of extension types [1] along with the > possibility of > >>>>>>>>> proposing new canonical extension types [2]. This seems to > cover the > >>>>>>>>> use-cases you mention but I might be misunderstanding? > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Micah > >>>>>>>>> > >>>>>>>>> [1] > >>>>>>>>> > >>>>>>>>> > >>>>>>> > https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types > >>>>>>>>> [2] > https://arrow.apache.org/docs/format/CanonicalExtensions.html > >>>>>>>>> > >>>>>>>>> On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan > >>>>>>>>> <norman.jor...@improving.com.invalid> wrote: > >>>>>>>>> > >>>>>>>>>> Problem Description > >>>>>>>>>> > >>>>>>>>>> Currently Arrow schemas can only contain columns of types > supported by > >>>>>>>>>> Arrow. In some cases an Arrow schema maps to an external > schema. This > >>>>>>> can > >>>>>>>>>> result in the Arrow schema not being able to support all the > columns > >>>>>>> from > >>>>>>>>>> the external schema. > >>>>>>>>>> > >>>>>>>>>> Consider an external system that contains a column of type > UUID. To > >>>>>>> model > >>>>>>>>>> the schema in Arrow, the user has two choices: > >>>>>>>>>> > >>>>>>>>>> 1. Do not include the UUID column in the Arrow schema > >>>>>>>>>> > >>>>>>>>>> 2. Map the column to an existing Arrow type. This will > not include > >>>>>>> the > >>>>>>>>>> original type information. A UUID can be mapped to a > FixedSizeBinary, > >>>>>>> but > >>>>>>>>>> consumers of the Arrow schema will be unable to distinguish a > >>>>>>>>>> FixedSizeBinary field from a UUID field. > >>>>>>>>>> > >>>>>>>>>> Possible Solution > >>>>>>>>>> > >>>>>>>>>> * Add a new type code that represents unsupported types > >>>>>>>>>> > >>>>>>>>>> * Values for the new type are represented as variable > length > >>>>>>> binary > >>>>>>>>>> > >>>>>>>>>> Some drivers can expose data even when they don’t understand > the data > >>>>>>>>>> type. For example, the PostgreSQL driver will return the raw > bytes for > >>>>>>>>>> fields of an unknown type. Using an explicit type lets clients > know > >>>>>>> that > >>>>>>>>>> they should convert values if they were able to determine the > actual > >>>>>>> data > >>>>>>>>>> type. > >>>>>>>>>> > >>>>>>>>>> Questions > >>>>>>>>>> > >>>>>>>>>> * What is the impact on existing clients when they > encounter > >>>>>>> fields > >>>>>>>>> of > >>>>>>>>>> the unsupported type? > >>>>>>>>>> > >>>>>>>>>> * Is it safe to assume that all unsupported values can > safely be > >>>>>>>>>> converted to a variable length binary? > >>>>>>>>>> > >>>>>>>>>> * How can we preserve information about the original > type? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> Warning: The sender of this message could not be validated and may > not be the actual sender. >