Should I take it that this proposal is dead in the water? While we could define our own Unknown/Other type for say the ADBC PostgreSQL driver it might be useful to have a singular type for consumers to latch on to.
On Fri, Apr 12, 2024, at 07:32, David Li wrote: > I think an "Other" extension type is slightly different than an > arbitrary extension type, though: the latter may be understood > downstream but the former represents a point at which a component > explicitly declares it does not know how to handle a field. In this > example, the PostgreSQL ADBC driver might be able to provide a > representation regardless, but a different driver (or say, the JDBC > adapter, which cannot necessarily get a bytestring for an arbitrary > JDBC type) may want an Other type to signal that it would fail if asked > to provide particular columns. > > On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote: >> Depending where your Arrow-encoded data is used, either extension >> types or generic field metadata are options. We have this problem in >> the ADBC Postgres driver, where we can convert *most* Postgres types >> to an Arrow type but there are some others where we can't or don't >> know or don't implement a conversion. Currently for these we return >> opaque binary (the Postgres COPY representation of the value) but put >> field metadata so that a consumer can implement a workaround for an >> unsupported type. It would be arguably better to have implemented this >> as an extension type; however, field metadata felt like less of a >> commitment when I first worked on this. >> >> Cheers, >> >> -dewey >> >> On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan >> <norman.jor...@improving.com.invalid> wrote: >>> >>> I was using UUID as an example. It looks like extension types covers my >>> original request. >>> ________________________________ >>> From: Felipe Oliveira Carvalho <felipe...@gmail.com> >>> Sent: Thursday, April 11, 2024 7:15 AM >>> To: dev@arrow.apache.org <dev@arrow.apache.org> >>> Subject: Re: Unsupported/Other Type >>> >>> The OP used UUID as an example. Would that be enough or the request is for >>> a flexible mechanism that allows the creation of one-off nominal types for >>> very specific use-cases? >>> >>> — >>> Felipe >>> >>> On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou <anto...@python.org> wrote: >>> >>> > >>> > Yes, JSON and UUID are obvious candidates for new canonical extension >>> > types. XML also comes to mind, but I'm not sure there's much of a use >>> > case for it. >>> > >>> > Regards >>> > >>> > Antoine. >>> > >>> > >>> > Le 10/04/2024 à 22:55, Wes McKinney a écrit : >>> > > In the past we have discussed adding a canonical type for UUID and JSON. >>> > I >>> > > still think this is a good idea and could improve ergonomics in >>> > downstream >>> > > language bindings (e.g. by exposing JSON querying function or >>> > automatically >>> > > boxing UUIDs in built-in UUID types, like the Python uuid library). Has >>> > > anyone done any work on this to anyone's knowledge? >>> > > >>> > > On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <emkornfi...@gmail.com> >>> > > wrote: >>> > > >>> > >> Hi Norman, >>> > >> Arrow has a concept of extension types [1] along with the possibility >>> > >> of >>> > >> proposing new canonical extension types [2]. This seems to cover the >>> > >> use-cases you mention but I might be misunderstanding? >>> > >> >>> > >> Thanks, >>> > >> Micah >>> > >> >>> > >> [1] >>> > >> >>> > >> >>> > https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types >>> > >> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html >>> > >> >>> > >> On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan >>> > >> <norman.jor...@improving.com.invalid> wrote: >>> > >> >>> > >>> Problem Description >>> > >>> >>> > >>> Currently Arrow schemas can only contain columns of types supported by >>> > >>> Arrow. In some cases an Arrow schema maps to an external schema. This >>> > can >>> > >>> result in the Arrow schema not being able to support all the columns >>> > from >>> > >>> the external schema. >>> > >>> >>> > >>> Consider an external system that contains a column of type UUID. To >>> > model >>> > >>> the schema in Arrow, the user has two choices: >>> > >>> >>> > >>> 1. Do not include the UUID column in the Arrow schema >>> > >>> >>> > >>> 2. Map the column to an existing Arrow type. This will not include >>> > the >>> > >>> original type information. A UUID can be mapped to a FixedSizeBinary, >>> > but >>> > >>> consumers of the Arrow schema will be unable to distinguish a >>> > >>> FixedSizeBinary field from a UUID field. >>> > >>> >>> > >>> Possible Solution >>> > >>> >>> > >>> * Add a new type code that represents unsupported types >>> > >>> >>> > >>> * Values for the new type are represented as variable length >>> > binary >>> > >>> >>> > >>> Some drivers can expose data even when they don’t understand the data >>> > >>> type. For example, the PostgreSQL driver will return the raw bytes for >>> > >>> fields of an unknown type. Using an explicit type lets clients know >>> > that >>> > >>> they should convert values if they were able to determine the actual >>> > data >>> > >>> type. >>> > >>> >>> > >>> Questions >>> > >>> >>> > >>> * What is the impact on existing clients when they encounter >>> > fields >>> > >> of >>> > >>> the unsupported type? >>> > >>> >>> > >>> * Is it safe to assume that all unsupported values can safely be >>> > >>> converted to a variable length binary? >>> > >>> >>> > >>> * How can we preserve information about the original type? >>> > >>> >>> > >>> >>> > >> >>> > > >>> > >>> Warning: The sender of this message could not be validated and may not be >>> the actual sender.