If my vote counts, then +1 from me to push this change forward. On Sun, Jul 27, 2025 at 10:46 AM Mike Carey <dtab...@gmail.com> wrote:
> +1 > > (in case I didn't already chime in with that) > > On 7/25/25 9:01 AM, Calvin Dani wrote: > > Hi, > > > > I’ve made the changes to the user model and added an example dataset and > > query for the schema inference functions. > > Open to any further feedback and I’d really appreciate your vote if you > > think it looks good! > > > > Thank you and Regards > > Calvin Dani > > > > > > On Thu, Feb 6, 2025 at 10:40 AM Mike Carey<dtab...@gmail.com> wrote: > > > >> I have put comments on the wiki - some thoughts about the user model, > etc. > >> > >> On 2/4/25 7:46 AM, Calvin Dani wrote: > >>> Hi, > >>> > >>> The APE has been updated following the implementation of Query 3, > >>> current_schema(), which fetches and aggregates the most recent schema > >> from > >>> the LSM components. > >>> > >>> The updates include: > >>> > >>> Syntax of the new query > >>> > >>> A flowchart illustrating how the query works > >>> > >>> I’d love to hear your thoughts and suggestions! If you find it > promising, > >>> I’d appreciate your vote. > >>> Thank you and Regards > >>> Calvin Dani > >>> > >>> On Wed, Dec 18, 2024 at 11:41 AM Mike Carey<dtab...@gmail.com> wrote: > >>> > >>>> Nice! > >>>> > >>>> On 12/13/24 4:32 PM, Calvin Dani wrote: > >>>>> Hi, > >>>>> > >>>>> Regarding the performance testing of the first query for schema > >>>> inference: > >>>>> We benchmarked it against contemporary methods, primarily Spark-based > >>>>> implementations, using a configuration of 2 node controllers and 8 > >>>>> data partitions. > >>>>> > >>>>> For a GitHub dataset of 51GB: > >>>>> > >>>>> Our approach inferred the schema in 51.6 seconds, > >>>>> > >>>>> Spark’s native implementation took 81.6 seconds, > >>>>> > >>>>> Methods by Spoth and Mior required 400+ seconds. > >>>>> > >>>>> I hope this is helpful. > >>>>> > >>>>> Regards > >>>>> Calvin Dani > >>>>> > >>>>> > >>>>> On Thu, Dec 12, 2024 at 5:27 AM Mike Carey<dtab...@gmail.com> wrote: > >>>>> > >>>>> Question - I think you were doing some perf testing - do you > have > >>>>> perf > >>>>> results for these (vs. the current schema function)? > >>>>> > >>>>> On 12/5/24 12:04 PM, Calvin Dani wrote: > >>>>> > Hi, > >>>>> > > >>>>> > Wanted to share an update regarding the features in the APE. > The > >>>> two > >>>>> > queries: > >>>>> > > >>>>> > 1. query_schema() > >>>>> > > >>>>> > 2. collection_schema() > >>>>> > > >>>>> > are now functional. The query_schema() implementation has > been > >>>>> submitted > >>>>> > for review. Once that is approved, I will proceed to submit > the > >>>>> > collection_schema() query, as it depends on the first query's > >> code. > >>>>> > > >>>>> > I would greatly appreciate your feedback, additional test > cases, > >>>>> and any > >>>>> > thoughts you have on this APE. I’m eager to refine it further > >>>>> or, if it > >>>>> > seems like a solid starting point, to receive approval for > this > >>>> APE. > >>>>> > > >>>>> > Thank you for your time and input! > >>>>> > > >>>>> > Regards > >>>>> > > >>>>> > Calvin Dani > >>>>> > > >>>>> > On Wed, Nov 6, 2024 at 4:06 PM Calvin > >>>>> Dani<calvinthomas.d...@gmail.com> > >>>>> > wrote: > >>>>> > > >>>>> >> Hi, > >>>>> >> > >>>>> >> The APE has been updated with those changes! > >>>>> >> > >>>>> >> Regards > >>>>> >> Calvin Dani > >>>>> >> > >>>>> >> On Fri, Nov 1, 2024 at 10:36 AM Mike Carey< > dtab...@gmail.com> > >>>>> wrote: > >>>>> >> > >>>>> >>> Excellent! +1 > >>>>> >>> > >>>>> >>> On Fri, Nov 1, 2024 at 9:35 AM Calvin > >>>>> Dani<calvinthomas.d...@gmail.com> > >>>>> >>> wrote: > >>>>> >>> > >>>>> >>>> Hi, > >>>>> >>>> > >>>>> >>>> Thank you for the feedback and as per last meeting here > our > >>>>> the changes > >>>>> >>>> that are incorporated to this APE. > >>>>> >>>> They are as follows: > >>>>> >>>> 1. Name of the schema inference functions > >>>>> >>>> 2. Schema inference functionality > >>>>> >>>> > >>>>> >>>> The summary of changes are as follows : > >>>>> >>>> > >>>>> >>>> 1. query_schema (Aggregate function that takes all > >>>>> records of the > >>>>> >>>> subquery and generates a JSON Schema), > >>>>> >>>> 2. collection_schema (JSON Schema translation of the > >> defined > >>>>> >>> datatypes > >>>>> >>>> in the metadata node) > >>>>> >>>> 3. current_schema (for columnar stores and converting > the > >>>>> inferred > >>>>> >>>> schema for storage compaction to JSON Schema) > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> Regards > >>>>> >>>> Calvin Dani > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> On Fri, Oct 4, 2024 at 10:28 AM Mike Carey< > dtab...@gmail.com > >>>>> wrote: > >>>>> >>>> > >>>>> >>>>> Great feature! I wasn't able to understand the query > >>>>> example(s), > >>>>> >>>>> though... Could those be cleaned up a little and > clarified? > >>>>> >>>>> > >>>>> >>>>> Also, I think we might want two functions at the user > level > >>>>> - one that > >>>>> >>>>> takes an expression as input and reports its schema, and > >>>>> another that > >>>>> >>>>> takes a dataset/collection name as input and reports its > >>>>> schema. The > >>>>> >>>>> first one would scan the results and say what the schema > is; > >>>>> the other > >>>>> >>>>> would use a more efficient approach (accessing and > combining > >>>> the > >>>>> >>>>> metadata from the collection's most recent LSM > components in > >>>>> each of > >>>>> >>> its > >>>>> >>>>> partitions). > >>>>> >>>>> > >>>>> >>>>> Cheers, > >>>>> >>>>> > >>>>> >>>>> Mike > >>>>> >>>>> > >>>>> >>>>> On 10/4/24 10:13 AM, Calvin Dani wrote: > >>>>> >>>>>> Initiating the discussion thread proposing a new > aggregate > >>>>> function > >>>>> >>> in > >>>>> >>>>>> AsterixDB. > >>>>> >>>>>> *Feature:* aggregate function to infer schema > >>>>> >>>>>> *Details:* This feature introduces schema inference as > an > >>>> SQL++ > >>>>> >>>> function > >>>>> >>>>>> directly integrated into AsterixDB. It is the first > >> approach > >>>> to > >>>>> >>> offer > >>>>> >>>>>> schema inference as a native SQL++ function, allowing > users > >>>>> to infer > >>>>> >>>>>> schemas for not only any dataset but also for queries > and > >>>>> >>> subqueries. > >>>>> >>>> Its > >>>>> >>>>>> output in JSON Schema, the industry standard, produces > both > >>>>> human > >>>>> >>> and > >>>>> >>>>>> machine-readable results, suitable for user > interpretation > >> or > >>>>> >>>> integration > >>>>> >>>>>> into other queries or programs. > >>>>> >>>>>> > >>>>> >>>>>> Utilizing the template of array_avg() in the Built-in > >>>>> Function and > >>>>> >>>>> Function > >>>>> >>>>>> collection file the array_schema() was implemented. > During > >>>> self > >>>>> >>>> review, a > >>>>> >>>>>> lot of defined aggregate functions for > >>>>> >>>>>> example SerializableAvgAggregateFunction > >>>>> >>>>>> and IntermediateAvgAggregateFunction are not being > utilised > >>>>> during > >>>>> >>>>>> array_schema() query. Is it due to different use cases > or > >> am I > >>>>> >>>> utilising > >>>>> >>>>> it > >>>>> >>>>>> incorrectly? > >>>>> >>>>>> > >>>>> >>>>>> Are there any resources to understand the functionality > of > >>>>> aggregate > >>>>> >>>>>> functions in the implementation? > >>>>> >>>>>> > >>>>> >>>>>> *APE* > >>>>> >>>>>> > >>>>> >>> > >>>>> > >> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*8*3A*Schema*Inference*Aggregate*Functions__;KyUrKysr!!MLMg-p0Z!HkhB6GEhDbMZCMLBJ-D-xVzlEWogaW_K1Q-PA5k7uWP9j65SjJ5GrpCfjkcY5JrwLFniY7bed7dan7Lt$ -- Shiva Jahangiri Assistant Professor in Computer Science and Engineering Department Santa Clara University