Re: Schema Aggregate Function (VOTE)

Mike Carey Sun, 27 Jul 2025 10:47:36 -0700

+1

(in case I didn't already chime in with that)


On 7/25/25 9:01 AM, Calvin Dani wrote:

Hi,

I’ve made the changes to the user model and added an example dataset and
query for the schema inference functions.
Open to any further feedback and I’d really appreciate your vote if you
think it looks good!

Thank you and Regards
Calvin Dani


On Thu, Feb 6, 2025 at 10:40 AM Mike Carey<dtab...@gmail.com> wrote:

I have put comments on the wiki - some thoughts about the user model, etc.

On 2/4/25 7:46 AM, Calvin Dani wrote:

Hi,

The APE has been updated following the implementation of Query 3,
current_schema(), which fetches and aggregates the most recent schema

from

the LSM components.

The updates include:

Syntax of the new query

A flowchart illustrating how the query works

I’d love to hear your thoughts and suggestions! If you find it promising,
I’d appreciate your vote.
Thank you and Regards
Calvin Dani

On Wed, Dec 18, 2024 at 11:41 AM Mike Carey<dtab...@gmail.com> wrote:

    Nice!

On 12/13/24 4:32 PM, Calvin Dani wrote:

Hi,

Regarding the performance testing of the first query for schema

inference:

We benchmarked it against contemporary methods, primarily Spark-based
implementations, using a configuration of 2 node controllers and 8
data partitions.

For a GitHub dataset of 51GB:

Our approach inferred the schema in 51.6 seconds,

Spark’s native implementation took 81.6 seconds,

Methods by Spoth and Mior required 400+ seconds.

I hope this is helpful.

Regards
Calvin Dani

On Thu, Dec 12, 2024 at 5:27 AM Mike Carey<dtab...@gmail.com> wrote:

      Question - I think you were doing some perf testing - do you have
      perf
      results for these (vs. the current schema function)?

      On 12/5/24 12:04 PM, Calvin Dani wrote:
      > Hi,
      >
      > Wanted to share an update regarding the features in the APE. The

two

      > queries:
      >
      > 1. query_schema()
      >
      > 2. collection_schema()
      >
      > are now functional. The query_schema() implementation has been
      submitted
      > for review. Once that is approved, I will proceed to submit the
      > collection_schema() query, as it depends on the first query's

code.

      >
      > I would greatly appreciate your feedback, additional test cases,
      and any
      > thoughts you have on this APE. I’m eager to refine it further
      or, if it
      > seems like a solid starting point, to receive approval for this

APE.

      >
      > Thank you for your time and input!
      >
      > Regards
      >
      > Calvin Dani
      >
      > On Wed, Nov 6, 2024 at 4:06 PM Calvin
      Dani<calvinthomas.d...@gmail.com>
      > wrote:
      >
      >> Hi,
      >>
      >> The APE has been updated with those changes!
      >>
      >> Regards
      >> Calvin Dani
      >>
      >> On Fri, Nov 1, 2024 at 10:36 AM Mike Carey<dtab...@gmail.com>
      wrote:
      >>
      >>> Excellent!  +1
      >>>
      >>> On Fri, Nov 1, 2024 at 9:35 AM Calvin
      Dani<calvinthomas.d...@gmail.com>
      >>> wrote:
      >>>
      >>>> Hi,
      >>>>
      >>>> Thank you for the feedback and as per last meeting here our
      the changes
      >>>> that are incorporated to this APE.
      >>>> They are as follows:
      >>>> 1.  Name of the schema inference functions
      >>>> 2. Schema inference functionality
      >>>>
      >>>> The summary of changes are as follows :
      >>>>
      >>>>     1. query_schema (Aggregate function that takes all
      records of the
      >>>>     subquery and generates a JSON Schema),
      >>>>     2. collection_schema (JSON Schema translation of the

defined

      >>> datatypes
      >>>>     in the metadata node)
      >>>>     3. current_schema (for columnar stores and converting the
      inferred
      >>>>     schema for storage compaction to JSON Schema)
      >>>>
      >>>>
      >>>> Regards
      >>>> Calvin Dani
      >>>>
      >>>>
      >>>> On Fri, Oct 4, 2024 at 10:28 AM Mike Carey<dtab...@gmail.com
      wrote:
      >>>>
      >>>>> Great feature!  I wasn't able to understand the query
      example(s),
      >>>>> though...  Could those be cleaned up a little and clarified?
      >>>>>
      >>>>> Also, I think we might want two functions at the user level
      - one that
      >>>>> takes an expression as input and reports its schema, and
      another that
      >>>>> takes a dataset/collection name as input and reports its
      schema.  The
      >>>>> first one would scan the results and say what the schema is;
      the other
      >>>>> would use a more efficient approach (accessing and combining

the

      >>>>> metadata from the collection's most recent LSM components in
      each of
      >>> its
      >>>>> partitions).
      >>>>>
      >>>>> Cheers,
      >>>>>
      >>>>> Mike
      >>>>>
      >>>>> On 10/4/24 10:13 AM, Calvin Dani wrote:
      >>>>>> Initiating the discussion thread proposing a new aggregate
      function
      >>> in
      >>>>>> AsterixDB.
      >>>>>> *Feature:* aggregate function to infer schema
      >>>>>> *Details:* This feature introduces schema inference as an

SQL++

      >>>> function
      >>>>>> directly integrated into AsterixDB. It is the first

approach

to

      >>> offer
      >>>>>> schema inference as a native SQL++ function, allowing users
      to infer
      >>>>>> schemas for not only any dataset but also for queries and
      >>> subqueries.
      >>>> Its
      >>>>>> output in JSON Schema, the industry standard, produces both
      human
      >>> and
      >>>>>> machine-readable results, suitable for user interpretation

or

      >>>> integration
      >>>>>> into other queries or programs.
      >>>>>>
      >>>>>> Utilizing the template of array_avg() in the Built-in
      Function and
      >>>>> Function
      >>>>>> collection file the array_schema() was implemented. During

self

      >>>> review, a
      >>>>>> lot of defined aggregate functions for
      >>>>>> example SerializableAvgAggregateFunction
      >>>>>> and IntermediateAvgAggregateFunction are not being utilised
      during
      >>>>>> array_schema() query. Is it due to different use cases or

am I

      >>>> utilising
      >>>>> it
      >>>>>> incorrectly?
      >>>>>>
      >>>>>> Are there any resources to understand the functionality of
      aggregate
      >>>>>> functions in the implementation?
      >>>>>>
      >>>>>> *APE*
      >>>>>>
      >>>

https://cwiki.apache.org/confluence/display/ASTERIXDB/APE+8%3A+Schema+Inference+Aggregate+Functions

Re: Schema Aggregate Function (VOTE)

Reply via email to