Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Weston Pace
It also seems that two variations of the variant encoding are being discussed. The original spec, currently housed in Spark, creates a variant array in row-major order, that is, each element in the array, is contained contiguously. So, if you have objects like `{"a": 7, "b": 3}` then the values f

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Antoine Pitrou
Le 22/08/2024 à 17:08, Curt Hagenlocher a écrit : (I also happen to want a canonical Arrow representation for variant data, as this type occurs in many databases but doesn't have a great representation today in ADBC results. That's why I filed [Format] Consider adding an official variant type

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Curt Hagenlocher
This seems to straddle that line, in that you can also view this as a way to represent semi-structured data in a manner that allows for more efficient querying and computation by breaking out some of its components into a more structured form. (I also happen to want a canonical Arrow representatio

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Antoine Pitrou
Ah, thanks. I've tried to find a rationale and ended up on https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is it a good description of what you're after? If so, then I don't think Arrow is a good match. This seems mostly to be a marshalling format for semi-structured data

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Gang Wu
Sorry for the inconvenience. This is the permalink for the discussion: https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou wrote: > > Hi Gang, > > Sorry, but can you give a pointer to the start of this discussion thread > in a readable

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Antoine Pitrou
Hi Gang, Sorry, but can you give a pointer to the start of this discussion thread in a readable format (for example a mailing-list archive)? It appears that dev@arrow wasn't cc'ed from the start and that can make it difficult to understand what this is about. Regards Antoine. Le 22/08/2

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Xuanwo
I personally believe arrow is a better choice since we will eventually have the same memory layout but different physical layouts in Parquet, ORC, or other file formats. One concern about this option I have is whether the Arrow community is willing to make this happen and maintain this specific

Re: [DISCUSS] Variant Spec Location

2024-08-21 Thread Gang Wu
It seems that we have reached a consensus to some extent that there should be a new home for the variant spec. The pending question is whether Parquet or Arrow is a better choice. As a committer from Arrow, Parquet and ORC communities, I am neutral to choose any and happy to help with the movement

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Micah Kornfield
> > That being said, I think the most important consideration for now is where > are the current maintainers / contributors to the variant type. If most of > them are already PMC members / committers on a project, it becomes a bit > easier. Otherwise if there isn't much overlap with a project's exi

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Will Jones
In being more engine and format agnostic, I agree the Arrow project might be a good host for such a specification. It seems like we want to move away from hosting in Spark to make it engine agnostic. But moving into Iceberg might make it less format agnostic, as I understand multiple formats might

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Gang Wu
+ dev@arrow Thanks for all the valuable suggestions! I am inclined to Micah's idea that Arrow might be a better host compared to Parquet. To give more context, I am taking the initiative to add the geometry type to both Parquet and ORC. I'd like to do the same thing for variant type in that varia