Nice!

On 12/13/24 4:32 PM, Calvin Dani wrote:
Hi,

Regarding the performance testing of the first query for schema inference:

We benchmarked it against contemporary methods, primarily Spark-based implementations, using a configuration of 2 node controllers and 8 data partitions.

For a GitHub dataset of 51GB:

Our approach inferred the schema in 51.6 seconds,

Spark’s native implementation took 81.6 seconds,

Methods by Spoth and Mior required 400+ seconds.

I hope this is helpful.

Regards
Calvin Dani


On Thu, Dec 12, 2024 at 5:27 AM Mike Carey <dtab...@gmail.com> wrote:

    Question - I think you were doing some perf testing - do you have
    perf
    results for these (vs. the current schema function)?

    On 12/5/24 12:04 PM, Calvin Dani wrote:
    > Hi,
    >
    > Wanted to share an update regarding the features in the APE. The two
    > queries:
    >
    > 1. query_schema()
    >
    > 2. collection_schema()
    >
    > are now functional. The query_schema() implementation has been
    submitted
    > for review. Once that is approved, I will proceed to submit the
    > collection_schema() query, as it depends on the first query's code.
    >
    > I would greatly appreciate your feedback, additional test cases,
    and any
    > thoughts you have on this APE. I’m eager to refine it further
    or, if it
    > seems like a solid starting point, to receive approval for this APE.
    >
    > Thank you for your time and input!
    >
    > Regards
    >
    > Calvin Dani
    >
    > On Wed, Nov 6, 2024 at 4:06 PM Calvin
    Dani<calvinthomas.d...@gmail.com>
    > wrote:
    >
    >> Hi,
    >>
    >> The APE has been updated with those changes!
    >>
    >> Regards
    >> Calvin Dani
    >>
    >> On Fri, Nov 1, 2024 at 10:36 AM Mike Carey<dtab...@gmail.com>
    wrote:
    >>
    >>> Excellent!  +1
    >>>
    >>> On Fri, Nov 1, 2024 at 9:35 AM Calvin
    Dani<calvinthomas.d...@gmail.com>
    >>> wrote:
    >>>
    >>>> Hi,
    >>>>
    >>>> Thank you for the feedback and as per last meeting here our
    the changes
    >>>> that are incorporated to this APE.
    >>>> They are as follows:
    >>>> 1.  Name of the schema inference functions
    >>>> 2. Schema inference functionality
    >>>>
    >>>> The summary of changes are as follows :
    >>>>
    >>>>     1. query_schema (Aggregate function that takes all
    records of the
    >>>>     subquery and generates a JSON Schema),
    >>>>     2. collection_schema (JSON Schema translation of the defined
    >>> datatypes
    >>>>     in the metadata node)
    >>>>     3. current_schema (for columnar stores and converting the
    inferred
    >>>>     schema for storage compaction to JSON Schema)
    >>>>
    >>>>
    >>>> Regards
    >>>> Calvin Dani
    >>>>
    >>>>
    >>>> On Fri, Oct 4, 2024 at 10:28 AM Mike Carey<dtab...@gmail.com>
    wrote:
    >>>>
    >>>>> Great feature!  I wasn't able to understand the query
    example(s),
    >>>>> though...  Could those be cleaned up a little and clarified?
    >>>>>
    >>>>> Also, I think we might want two functions at the user level
    - one that
    >>>>> takes an expression as input and reports its schema, and
    another that
    >>>>> takes a dataset/collection name as input and reports its
    schema.  The
    >>>>> first one would scan the results and say what the schema is;
    the other
    >>>>> would use a more efficient approach (accessing and combining the
    >>>>> metadata from the collection's most recent LSM components in
    each of
    >>> its
    >>>>> partitions).
    >>>>>
    >>>>> Cheers,
    >>>>>
    >>>>> Mike
    >>>>>
    >>>>> On 10/4/24 10:13 AM, Calvin Dani wrote:
    >>>>>> Initiating the discussion thread proposing a new aggregate
    function
    >>> in
    >>>>>> AsterixDB.
    >>>>>> *Feature:* aggregate function to infer schema
    >>>>>> *Details:* This feature introduces schema inference as an SQL++
    >>>> function
    >>>>>> directly integrated into AsterixDB. It is the first approach to
    >>> offer
    >>>>>> schema inference as a native SQL++ function, allowing users
    to infer
    >>>>>> schemas for not only any dataset but also for queries and
    >>> subqueries.
    >>>> Its
    >>>>>> output in JSON Schema, the industry standard, produces both
    human
    >>> and
    >>>>>> machine-readable results, suitable for user interpretation or
    >>>> integration
    >>>>>> into other queries or programs.
    >>>>>>
    >>>>>> Utilizing the template of array_avg() in the Built-in
    Function and
    >>>>> Function
    >>>>>> collection file the array_schema() was implemented. During self
    >>>> review, a
    >>>>>> lot of defined aggregate functions for
    >>>>>> example SerializableAvgAggregateFunction
    >>>>>> and IntermediateAvgAggregateFunction are not being utilised
    during
    >>>>>> array_schema() query. Is it due to different use cases or am I
    >>>> utilising
    >>>>> it
    >>>>>> incorrectly?
    >>>>>>
    >>>>>> Are there any resources to understand the functionality of
    aggregate
    >>>>>> functions in the implementation?
    >>>>>>
    >>>>>> *APE*
    >>>>>>
    >>>
    
https://cwiki.apache.org/confluence/display/ASTERIXDB/APE+8%3A+Schema+Inference+Aggregate+Functions

Reply via email to