Hi Andrew, Thanks for the reply! I did exactly that and considered first to see if we can start by only handling it in the application level but that's a no go for us to migrate to arrow (from our own type system) as this basically removes a lot of the benefits such as the built-in csv writer, parquet and bunch of other things that we will need to implement on our own and also this will create a suboptimal experience (worse than the current one we have, hence we can't migrate) for us and anyone building cloudquery plugins and using our SDK.
I created a PR <https://github.com/apache/arrow/pull/34454> for the Go implementation already with an example of how we intended <https://github.com/cloudquery/filetypes/tree/main/internal/cqarrow> to use it. Already got great feedback from Matt Topol. Any more feedback and ideas are welcome. If this abstraction would work well I think other languages might benefit from that (though for us right now we only use Go). On Mon, Mar 6, 2023 at 2:08 PM Andrew Lamb <al...@influxdata.com> wrote: > Hi Yevgeny, > > It is great you are thinking of using Arrow. > > > - The problems are around the abstraction for the extension types. While > I > understand that the underlying storage needs to be supported in the library > we don't have a way for extensions to provide its own builder which means > the user needs to know how the extension type stores the type inside the > binary. This creates a leaky abstraction and the need for various helper > functions like `UUIDToBinary` > > I don't have anything specific to offer in terms of the Go implementation. > > However, In terms of helping define a better abstraction, one way you might > proceed is to forgo using the library support for extension types and > implement support for your custom types yourself in your application code. > Once you have figured out the most useful APIs, then perhaps you could > propose contributing them to the arrow Go implementation. > > Andrew > > > > > > > On Fri, Mar 3, 2023 at 5:54 AM Yevgeny Pats <y...@cloudquery.io> wrote: > > > Hey folks, > > > > Hopefully this is the right place to ask. As some background I'm Yevgeny > > Pats <https://www.linkedin.com/in/yevgeny-pats-5973328b/>, Founder @ > > CloudQuery <https://github.com/cloudquery/cloudquery> . We are very > > interested in migrating our protocol and Go type system to Apache Arrow. > > Extensions are a critical part for us and thus I've the following > questions > > on whether it's a usage problem on my end or something that is not yet > > available. I'll give here an example for Go but I believe the same issue > > exists in all libraries/languages. > > > > Here is a public github gist > > <https://gist.github.com/yevgenypats/6969e8e598161fc2021612c780bba3eb>. > > > > What are the problems: > > > > - The problems are around the abstraction for the extension types. While > I > > understand that the underlying storage needs to be supported in the > library > > we don't have a way for extensions to provide its own builder which means > > the user needs to know how the extension type stores the type inside the > > binary. This creates a leaky abstraction and the need for various helper > > functions like `UUIDToBinary` > > - The other way is fine as you can have methods like ToUUID on top of the > > extension array. But this creates asymmetry in the abstraction. > > - Because we don't control the builder for extensions this cripples into > > other places like json > > <https://github.com/apache/arrow/issues/34292#issuecomment-1446653210> > and > > csv where we can't control marshalling (in the same way we control all > > other built-in types). So basically for extensions that use binary type > as > > underlying storage in case of json and csv those will always be encoded > as > > base64 which is not very useful (think about uuid, ip address, mac > > address). > > > > The main point is that I think the right abstraction for extensions > should > > provide all the apis (type, array, builder) just like built-in types, > > otherwise the abstraction is incomplete or "leaky". Of course we can > still > > have limitations like the custom builder must use an underlying known > > storage (for it to work over ipc) but it can still control various other > > types like marshaling, unmarshaling, building, and so on. > > > > Hopefully this gives enough context but would love to elaborate. > > > > Thanks, > > Yevgeny > > >