Understanding possible synergies between arrow & zarr communities?

2024-07-08 Thread Carl Boettiger
Hi folks, Neal Richardson suggested on the rOpenSci slack I might pose this question to this list. As an observer to both communities, I'm interested in if there is or might be more communication between the Pangeo community's focus on Zarr serialization with what the Arrow team has done with Par

Re: Arrow board report due July 17

2024-07-08 Thread Andy Grove
Thank you for the updates so far. However, we still need to add content for the following subprojects. The report is due this Wednesday, July 10 (the board meeting is on July 17). - Arrow Flight SQL adapter for PostgreSQL - Nanoarrow - C++ - Dataset & Parquet -

Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Jeremy Leibs
On Mon, Jul 8, 2024 at 3:57 PM Weston Pace wrote: > > user-facing API documentation someone would need to practically form > and/or > > process data when integrating a library into their code. > > If we are thinking API contract / programmatic access then I'd offer yet > another alternative. At

Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Jeremy Leibs
Thanks for the links. That's very helpful context. It's a shame the flatbuffer <-> json conversion isn't more widely available, though I do see the complexity now. It sounds like our best path forward for now will be to generate a pair of assets for each of our types: - A binary fbs-encoded IPC

Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Aldrin
Based on the response to using an empty IPC stream/file, it sounds to me like something substrait-like is ideal. Maybe an interface that can go between the equivalent of relational schemas and (generated) arrow code as you have shown. Then, there could be straightforward integration points with

Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Weston Pace
> but it doesn't address questions of the kind of > user-facing API documentation someone would need to practically form and/or > process data when integrating a library into their code. Agreed that IPC / flatbuffers / proto are not useful here. JSON might help and YAML would be more pleasantly c

Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Ian Cook
This has come up a few times in the past [1][2]. The main concern has been about cross-version compatibility guarantees. [1] https://github.com/apache/arrow/issues/25078 [2] https://lists.apache.org/thread/02p37yxksxccsqfn9l6j4ryno404ttnl On Mon, Jul 8, 2024 at 3:10 PM Lee, David (PAG) wrote: >

RE: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Lee, David (PAG)
Gah found a bug with my code.. Here's a corrected python version.. # iterate through possible nested columns def _convert_to_arrow_type(field, obj): """ :param field: :param obj: :returns: pyarrow datatype """ if isinstance(obj, list): for child_obj in obj:

RE: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Lee, David (PAG)
I came up with my own json representation that I could put into json / yaml config files with some python code to convert this into a pyarrow schema object.. - yaml flat example- fields: cusip: string start_date: date32 end_date: date32 purpose: string source:

Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Jorge Cardoso Leitão
Hi, So, something like a human and computer readable standard for arrow schemas, e.g. via yaml or a json schema. We kind of do this in our integration tests / golden tests, where we have a non-official json representation of an arrow schema. The ask here is to standardize such a format in some

Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Jeremy Leibs
That handles questions of machine-to-machine coordination, and let's me do things like validation, but it doesn't address questions of the kind of user-facing API documentation someone would need to practically form and/or process data when integrating a library into their code. I want to be able

Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Weston Pace
+1 for empty stream/file as schema serialization. I have used this approach myself on more than one occasion and it works well. It can even be useful for transmitting schemas between different arrow-native libraries in the same language (e.g. rust->rust) since it allows the different libraries to

Re: [DISCUSS] Approach to generic schema representation

2024-07-08 Thread Matt Topol
Hey Jeremy, Currently the first message of an IPC stream is a Schema message which consists solely of a flatbuffer message and defined in the Schema.fbs file of the Arrow repo. All of the libraries that can read Arrow IPC should be able to also handle converting a single IPC schema message back in

[DISCUSS] Approach to generic schema representation

2024-07-08 Thread Jeremy Leibs
I'm looking for any advice folks may have on a generic way to document and represent expected arrow schemas as part of an interface definition. For context, our library provides a cross-language (python, c++, rust) SDK for logging semantic multi-modal data (point clouds, images, geometric transfor