Re: Schema Aggregate Function (VOTE)

Ian Maxon Thu, 31 Jul 2025 10:44:57 -0700

Can we propose this as a vote, like
https://lists.apache.org/thread/2k3mk471rflrnwwq64dtjhy8ydblwb92 ? I
think this APE has a similar pattern, where it went through some
discussion and revision. I think in those cases, calling for a vote is
best, rather than using the expedited process.


On Mon, Jul 28, 2025 at 3:29 PM Shiva Jahangiri
<[email protected]> wrote:
>
> If my vote counts, then +1 from me to push this change forward.
>
> On Sun, Jul 27, 2025 at 10:46 AM Mike Carey <[email protected]> wrote:
>
> > +1
> >
> > (in case I didn't already chime in with that)
> >
> > On 7/25/25 9:01 AM, Calvin Dani wrote:
> > > Hi,
> > >
> > > I’ve made the changes to the user model and added an example dataset and
> > > query for the schema inference functions.
> > > Open to any further feedback and I’d really appreciate your vote if you
> > > think it looks good!
> > >
> > > Thank you and Regards
> > > Calvin Dani
> > >
> > >
> > > On Thu, Feb 6, 2025 at 10:40 AM Mike Carey<[email protected]> wrote:
> > >
> > >> I have put comments on the wiki - some thoughts about the user model,
> > etc.
> > >>
> > >> On 2/4/25 7:46 AM, Calvin Dani wrote:
> > >>> Hi,
> > >>>
> > >>> The APE has been updated following the implementation of Query 3,
> > >>> current_schema(), which fetches and aggregates the most recent schema
> > >> from
> > >>> the LSM components.
> > >>>
> > >>> The updates include:
> > >>>
> > >>> Syntax of the new query
> > >>>
> > >>> A flowchart illustrating how the query works
> > >>>
> > >>> I’d love to hear your thoughts and suggestions! If you find it
> > promising,
> > >>> I’d appreciate your vote.
> > >>> Thank you and Regards
> > >>> Calvin Dani
> > >>>
> > >>> On Wed, Dec 18, 2024 at 11:41 AM Mike Carey<[email protected]> wrote:
> > >>>
> > >>>>     Nice!
> > >>>>
> > >>>> On 12/13/24 4:32 PM, Calvin Dani wrote:
> > >>>>> Hi,
> > >>>>>
> > >>>>> Regarding the performance testing of the first query for schema
> > >>>> inference:
> > >>>>> We benchmarked it against contemporary methods, primarily Spark-based
> > >>>>> implementations, using a configuration of 2 node controllers and 8
> > >>>>> data partitions.
> > >>>>>
> > >>>>> For a GitHub dataset of 51GB:
> > >>>>>
> > >>>>> Our approach inferred the schema in 51.6 seconds,
> > >>>>>
> > >>>>> Spark’s native implementation took 81.6 seconds,
> > >>>>>
> > >>>>> Methods by Spoth and Mior required 400+ seconds.
> > >>>>>
> > >>>>> I hope this is helpful.
> > >>>>>
> > >>>>> Regards
> > >>>>> Calvin Dani
> > >>>>>
> > >>>>>
> > >>>>> On Thu, Dec 12, 2024 at 5:27 AM Mike Carey<[email protected]> wrote:
> > >>>>>
> > >>>>>       Question - I think you were doing some perf testing - do you
> > have
> > >>>>>       perf
> > >>>>>       results for these (vs. the current schema function)?
> > >>>>>
> > >>>>>       On 12/5/24 12:04 PM, Calvin Dani wrote:
> > >>>>>       > Hi,
> > >>>>>       >
> > >>>>>       > Wanted to share an update regarding the features in the APE.
> > The
> > >>>> two
> > >>>>>       > queries:
> > >>>>>       >
> > >>>>>       > 1. query_schema()
> > >>>>>       >
> > >>>>>       > 2. collection_schema()
> > >>>>>       >
> > >>>>>       > are now functional. The query_schema() implementation has
> > been
> > >>>>>       submitted
> > >>>>>       > for review. Once that is approved, I will proceed to submit
> > the
> > >>>>>       > collection_schema() query, as it depends on the first query's
> > >> code.
> > >>>>>       >
> > >>>>>       > I would greatly appreciate your feedback, additional test
> > cases,
> > >>>>>       and any
> > >>>>>       > thoughts you have on this APE. I’m eager to refine it further
> > >>>>>       or, if it
> > >>>>>       > seems like a solid starting point, to receive approval for
> > this
> > >>>> APE.
> > >>>>>       >
> > >>>>>       > Thank you for your time and input!
> > >>>>>       >
> > >>>>>       > Regards
> > >>>>>       >
> > >>>>>       > Calvin Dani
> > >>>>>       >
> > >>>>>       > On Wed, Nov 6, 2024 at 4:06 PM Calvin
> > >>>>>       Dani<[email protected]>
> > >>>>>       > wrote:
> > >>>>>       >
> > >>>>>       >> Hi,
> > >>>>>       >>
> > >>>>>       >> The APE has been updated with those changes!
> > >>>>>       >>
> > >>>>>       >> Regards
> > >>>>>       >> Calvin Dani
> > >>>>>       >>
> > >>>>>       >> On Fri, Nov 1, 2024 at 10:36 AM Mike Carey<
> > [email protected]>
> > >>>>>       wrote:
> > >>>>>       >>
> > >>>>>       >>> Excellent!  +1
> > >>>>>       >>>
> > >>>>>       >>> On Fri, Nov 1, 2024 at 9:35 AM Calvin
> > >>>>>       Dani<[email protected]>
> > >>>>>       >>> wrote:
> > >>>>>       >>>
> > >>>>>       >>>> Hi,
> > >>>>>       >>>>
> > >>>>>       >>>> Thank you for the feedback and as per last meeting here
> > our
> > >>>>>       the changes
> > >>>>>       >>>> that are incorporated to this APE.
> > >>>>>       >>>> They are as follows:
> > >>>>>       >>>> 1.  Name of the schema inference functions
> > >>>>>       >>>> 2. Schema inference functionality
> > >>>>>       >>>>
> > >>>>>       >>>> The summary of changes are as follows :
> > >>>>>       >>>>
> > >>>>>       >>>>     1. query_schema (Aggregate function that takes all
> > >>>>>       records of the
> > >>>>>       >>>>     subquery and generates a JSON Schema),
> > >>>>>       >>>>     2. collection_schema (JSON Schema translation of the
> > >> defined
> > >>>>>       >>> datatypes
> > >>>>>       >>>>     in the metadata node)
> > >>>>>       >>>>     3. current_schema (for columnar stores and converting
> > the
> > >>>>>       inferred
> > >>>>>       >>>>     schema for storage compaction to JSON Schema)
> > >>>>>       >>>>
> > >>>>>       >>>>
> > >>>>>       >>>> Regards
> > >>>>>       >>>> Calvin Dani
> > >>>>>       >>>>
> > >>>>>       >>>>
> > >>>>>       >>>> On Fri, Oct 4, 2024 at 10:28 AM Mike Carey<
> > [email protected]
> > >>>>>       wrote:
> > >>>>>       >>>>
> > >>>>>       >>>>> Great feature!  I wasn't able to understand the query
> > >>>>>       example(s),
> > >>>>>       >>>>> though...  Could those be cleaned up a little and
> > clarified?
> > >>>>>       >>>>>
> > >>>>>       >>>>> Also, I think we might want two functions at the user
> > level
> > >>>>>       - one that
> > >>>>>       >>>>> takes an expression as input and reports its schema, and
> > >>>>>       another that
> > >>>>>       >>>>> takes a dataset/collection name as input and reports its
> > >>>>>       schema.  The
> > >>>>>       >>>>> first one would scan the results and say what the schema
> > is;
> > >>>>>       the other
> > >>>>>       >>>>> would use a more efficient approach (accessing and
> > combining
> > >>>> the
> > >>>>>       >>>>> metadata from the collection's most recent LSM
> > components in
> > >>>>>       each of
> > >>>>>       >>> its
> > >>>>>       >>>>> partitions).
> > >>>>>       >>>>>
> > >>>>>       >>>>> Cheers,
> > >>>>>       >>>>>
> > >>>>>       >>>>> Mike
> > >>>>>       >>>>>
> > >>>>>       >>>>> On 10/4/24 10:13 AM, Calvin Dani wrote:
> > >>>>>       >>>>>> Initiating the discussion thread proposing a new
> > aggregate
> > >>>>>       function
> > >>>>>       >>> in
> > >>>>>       >>>>>> AsterixDB.
> > >>>>>       >>>>>> *Feature:* aggregate function to infer schema
> > >>>>>       >>>>>> *Details:* This feature introduces schema inference as
> > an
> > >>>> SQL++
> > >>>>>       >>>> function
> > >>>>>       >>>>>> directly integrated into AsterixDB. It is the first
> > >> approach
> > >>>> to
> > >>>>>       >>> offer
> > >>>>>       >>>>>> schema inference as a native SQL++ function, allowing
> > users
> > >>>>>       to infer
> > >>>>>       >>>>>> schemas for not only any dataset but also for queries
> > and
> > >>>>>       >>> subqueries.
> > >>>>>       >>>> Its
> > >>>>>       >>>>>> output in JSON Schema, the industry standard, produces
> > both
> > >>>>>       human
> > >>>>>       >>> and
> > >>>>>       >>>>>> machine-readable results, suitable for user
> > interpretation
> > >> or
> > >>>>>       >>>> integration
> > >>>>>       >>>>>> into other queries or programs.
> > >>>>>       >>>>>>
> > >>>>>       >>>>>> Utilizing the template of array_avg() in the Built-in
> > >>>>>       Function and
> > >>>>>       >>>>> Function
> > >>>>>       >>>>>> collection file the array_schema() was implemented.
> > During
> > >>>> self
> > >>>>>       >>>> review, a
> > >>>>>       >>>>>> lot of defined aggregate functions for
> > >>>>>       >>>>>> example SerializableAvgAggregateFunction
> > >>>>>       >>>>>> and IntermediateAvgAggregateFunction are not being
> > utilised
> > >>>>>       during
> > >>>>>       >>>>>> array_schema() query. Is it due to different use cases
> > or
> > >> am I
> > >>>>>       >>>> utilising
> > >>>>>       >>>>> it
> > >>>>>       >>>>>> incorrectly?
> > >>>>>       >>>>>>
> > >>>>>       >>>>>> Are there any resources to understand the functionality
> > of
> > >>>>>       aggregate
> > >>>>>       >>>>>> functions in the implementation?
> > >>>>>       >>>>>>
> > >>>>>       >>>>>> *APE*
> > >>>>>       >>>>>>
> > >>>>>       >>>
> > >>>>>
> > >>
> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*8*3A*Schema*Inference*Aggregate*Functions__;KyUrKysr!!MLMg-p0Z!HkhB6GEhDbMZCMLBJ-D-xVzlEWogaW_K1Q-PA5k7uWP9j65SjJ5GrpCfjkcY5JrwLFniY7bed7dan7Lt$
>
>
>
> --
> Shiva Jahangiri
> Assistant Professor in Computer Science and Engineering Department
> Santa Clara University

Re: Schema Aggregate Function (VOTE)

Reply via email to