Re: Schema Aggregate Function (VOTE)

Shiva Jahangiri Mon, 28 Jul 2025 15:30:11 -0700

If my vote counts, then +1 from me to push this change forward.

On Sun, Jul 27, 2025 at 10:46 AM Mike Carey <[email protected]> wrote:


> +1
>
> (in case I didn't already chime in with that)
>
> On 7/25/25 9:01 AM, Calvin Dani wrote:
> > Hi,
> >
> > I’ve made the changes to the user model and added an example dataset and
> > query for the schema inference functions.
> > Open to any further feedback and I’d really appreciate your vote if you
> > think it looks good!
> >
> > Thank you and Regards
> > Calvin Dani
> >
> >
> > On Thu, Feb 6, 2025 at 10:40 AM Mike Carey<[email protected]> wrote:
> >
> >> I have put comments on the wiki - some thoughts about the user model,
> etc.
> >>
> >> On 2/4/25 7:46 AM, Calvin Dani wrote:
> >>> Hi,
> >>>
> >>> The APE has been updated following the implementation of Query 3,
> >>> current_schema(), which fetches and aggregates the most recent schema
> >> from
> >>> the LSM components.
> >>>
> >>> The updates include:
> >>>
> >>> Syntax of the new query
> >>>
> >>> A flowchart illustrating how the query works
> >>>
> >>> I’d love to hear your thoughts and suggestions! If you find it
> promising,
> >>> I’d appreciate your vote.
> >>> Thank you and Regards
> >>> Calvin Dani
> >>>
> >>> On Wed, Dec 18, 2024 at 11:41 AM Mike Carey<[email protected]> wrote:
> >>>
> >>>>     Nice!
> >>>>
> >>>> On 12/13/24 4:32 PM, Calvin Dani wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Regarding the performance testing of the first query for schema
> >>>> inference:
> >>>>> We benchmarked it against contemporary methods, primarily Spark-based
> >>>>> implementations, using a configuration of 2 node controllers and 8
> >>>>> data partitions.
> >>>>>
> >>>>> For a GitHub dataset of 51GB:
> >>>>>
> >>>>> Our approach inferred the schema in 51.6 seconds,
> >>>>>
> >>>>> Spark’s native implementation took 81.6 seconds,
> >>>>>
> >>>>> Methods by Spoth and Mior required 400+ seconds.
> >>>>>
> >>>>> I hope this is helpful.
> >>>>>
> >>>>> Regards
> >>>>> Calvin Dani
> >>>>>
> >>>>>
> >>>>> On Thu, Dec 12, 2024 at 5:27 AM Mike Carey<[email protected]> wrote:
> >>>>>
> >>>>>       Question - I think you were doing some perf testing - do you
> have
> >>>>>       perf
> >>>>>       results for these (vs. the current schema function)?
> >>>>>
> >>>>>       On 12/5/24 12:04 PM, Calvin Dani wrote:
> >>>>>       > Hi,
> >>>>>       >
> >>>>>       > Wanted to share an update regarding the features in the APE.
> The
> >>>> two
> >>>>>       > queries:
> >>>>>       >
> >>>>>       > 1. query_schema()
> >>>>>       >
> >>>>>       > 2. collection_schema()
> >>>>>       >
> >>>>>       > are now functional. The query_schema() implementation has
> been
> >>>>>       submitted
> >>>>>       > for review. Once that is approved, I will proceed to submit
> the
> >>>>>       > collection_schema() query, as it depends on the first query's
> >> code.
> >>>>>       >
> >>>>>       > I would greatly appreciate your feedback, additional test
> cases,
> >>>>>       and any
> >>>>>       > thoughts you have on this APE. I’m eager to refine it further
> >>>>>       or, if it
> >>>>>       > seems like a solid starting point, to receive approval for
> this
> >>>> APE.
> >>>>>       >
> >>>>>       > Thank you for your time and input!
> >>>>>       >
> >>>>>       > Regards
> >>>>>       >
> >>>>>       > Calvin Dani
> >>>>>       >
> >>>>>       > On Wed, Nov 6, 2024 at 4:06 PM Calvin
> >>>>>       Dani<[email protected]>
> >>>>>       > wrote:
> >>>>>       >
> >>>>>       >> Hi,
> >>>>>       >>
> >>>>>       >> The APE has been updated with those changes!
> >>>>>       >>
> >>>>>       >> Regards
> >>>>>       >> Calvin Dani
> >>>>>       >>
> >>>>>       >> On Fri, Nov 1, 2024 at 10:36 AM Mike Carey<
> [email protected]>
> >>>>>       wrote:
> >>>>>       >>
> >>>>>       >>> Excellent!  +1
> >>>>>       >>>
> >>>>>       >>> On Fri, Nov 1, 2024 at 9:35 AM Calvin
> >>>>>       Dani<[email protected]>
> >>>>>       >>> wrote:
> >>>>>       >>>
> >>>>>       >>>> Hi,
> >>>>>       >>>>
> >>>>>       >>>> Thank you for the feedback and as per last meeting here
> our
> >>>>>       the changes
> >>>>>       >>>> that are incorporated to this APE.
> >>>>>       >>>> They are as follows:
> >>>>>       >>>> 1.  Name of the schema inference functions
> >>>>>       >>>> 2. Schema inference functionality
> >>>>>       >>>>
> >>>>>       >>>> The summary of changes are as follows :
> >>>>>       >>>>
> >>>>>       >>>>     1. query_schema (Aggregate function that takes all
> >>>>>       records of the
> >>>>>       >>>>     subquery and generates a JSON Schema),
> >>>>>       >>>>     2. collection_schema (JSON Schema translation of the
> >> defined
> >>>>>       >>> datatypes
> >>>>>       >>>>     in the metadata node)
> >>>>>       >>>>     3. current_schema (for columnar stores and converting
> the
> >>>>>       inferred
> >>>>>       >>>>     schema for storage compaction to JSON Schema)
> >>>>>       >>>>
> >>>>>       >>>>
> >>>>>       >>>> Regards
> >>>>>       >>>> Calvin Dani
> >>>>>       >>>>
> >>>>>       >>>>
> >>>>>       >>>> On Fri, Oct 4, 2024 at 10:28 AM Mike Carey<
> [email protected]
> >>>>>       wrote:
> >>>>>       >>>>
> >>>>>       >>>>> Great feature!  I wasn't able to understand the query
> >>>>>       example(s),
> >>>>>       >>>>> though...  Could those be cleaned up a little and
> clarified?
> >>>>>       >>>>>
> >>>>>       >>>>> Also, I think we might want two functions at the user
> level
> >>>>>       - one that
> >>>>>       >>>>> takes an expression as input and reports its schema, and
> >>>>>       another that
> >>>>>       >>>>> takes a dataset/collection name as input and reports its
> >>>>>       schema.  The
> >>>>>       >>>>> first one would scan the results and say what the schema
> is;
> >>>>>       the other
> >>>>>       >>>>> would use a more efficient approach (accessing and
> combining
> >>>> the
> >>>>>       >>>>> metadata from the collection's most recent LSM
> components in
> >>>>>       each of
> >>>>>       >>> its
> >>>>>       >>>>> partitions).
> >>>>>       >>>>>
> >>>>>       >>>>> Cheers,
> >>>>>       >>>>>
> >>>>>       >>>>> Mike
> >>>>>       >>>>>
> >>>>>       >>>>> On 10/4/24 10:13 AM, Calvin Dani wrote:
> >>>>>       >>>>>> Initiating the discussion thread proposing a new
> aggregate
> >>>>>       function
> >>>>>       >>> in
> >>>>>       >>>>>> AsterixDB.
> >>>>>       >>>>>> *Feature:* aggregate function to infer schema
> >>>>>       >>>>>> *Details:* This feature introduces schema inference as
> an
> >>>> SQL++
> >>>>>       >>>> function
> >>>>>       >>>>>> directly integrated into AsterixDB. It is the first
> >> approach
> >>>> to
> >>>>>       >>> offer
> >>>>>       >>>>>> schema inference as a native SQL++ function, allowing
> users
> >>>>>       to infer
> >>>>>       >>>>>> schemas for not only any dataset but also for queries
> and
> >>>>>       >>> subqueries.
> >>>>>       >>>> Its
> >>>>>       >>>>>> output in JSON Schema, the industry standard, produces
> both
> >>>>>       human
> >>>>>       >>> and
> >>>>>       >>>>>> machine-readable results, suitable for user
> interpretation
> >> or
> >>>>>       >>>> integration
> >>>>>       >>>>>> into other queries or programs.
> >>>>>       >>>>>>
> >>>>>       >>>>>> Utilizing the template of array_avg() in the Built-in
> >>>>>       Function and
> >>>>>       >>>>> Function
> >>>>>       >>>>>> collection file the array_schema() was implemented.
> During
> >>>> self
> >>>>>       >>>> review, a
> >>>>>       >>>>>> lot of defined aggregate functions for
> >>>>>       >>>>>> example SerializableAvgAggregateFunction
> >>>>>       >>>>>> and IntermediateAvgAggregateFunction are not being
> utilised
> >>>>>       during
> >>>>>       >>>>>> array_schema() query. Is it due to different use cases
> or
> >> am I
> >>>>>       >>>> utilising
> >>>>>       >>>>> it
> >>>>>       >>>>>> incorrectly?
> >>>>>       >>>>>>
> >>>>>       >>>>>> Are there any resources to understand the functionality
> of
> >>>>>       aggregate
> >>>>>       >>>>>> functions in the implementation?
> >>>>>       >>>>>>
> >>>>>       >>>>>> *APE*
> >>>>>       >>>>>>
> >>>>>       >>>
> >>>>>
> >>
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*8*3A*Schema*Inference*Aggregate*Functions__;KyUrKysr!!MLMg-p0Z!HkhB6GEhDbMZCMLBJ-D-xVzlEWogaW_K1Q-PA5k7uWP9j65SjJ5GrpCfjkcY5JrwLFniY7bed7dan7Lt$



-- 
Shiva Jahangiri
Assistant Professor in Computer Science and Engineering Department
Santa Clara University

Re: Schema Aggregate Function (VOTE)

Reply via email to