Can we propose this as a vote, like https://lists.apache.org/thread/2k3mk471rflrnwwq64dtjhy8ydblwb92 ? I think this APE has a similar pattern, where it went through some discussion and revision. I think in those cases, calling for a vote is best, rather than using the expedited process.
On Mon, Jul 28, 2025 at 3:29 PM Shiva Jahangiri <sjahang...@scu.edu.invalid> wrote: > > If my vote counts, then +1 from me to push this change forward. > > On Sun, Jul 27, 2025 at 10:46 AM Mike Carey <dtab...@gmail.com> wrote: > > > +1 > > > > (in case I didn't already chime in with that) > > > > On 7/25/25 9:01 AM, Calvin Dani wrote: > > > Hi, > > > > > > I’ve made the changes to the user model and added an example dataset and > > > query for the schema inference functions. > > > Open to any further feedback and I’d really appreciate your vote if you > > > think it looks good! > > > > > > Thank you and Regards > > > Calvin Dani > > > > > > > > > On Thu, Feb 6, 2025 at 10:40 AM Mike Carey<dtab...@gmail.com> wrote: > > > > > >> I have put comments on the wiki - some thoughts about the user model, > > etc. > > >> > > >> On 2/4/25 7:46 AM, Calvin Dani wrote: > > >>> Hi, > > >>> > > >>> The APE has been updated following the implementation of Query 3, > > >>> current_schema(), which fetches and aggregates the most recent schema > > >> from > > >>> the LSM components. > > >>> > > >>> The updates include: > > >>> > > >>> Syntax of the new query > > >>> > > >>> A flowchart illustrating how the query works > > >>> > > >>> I’d love to hear your thoughts and suggestions! If you find it > > promising, > > >>> I’d appreciate your vote. > > >>> Thank you and Regards > > >>> Calvin Dani > > >>> > > >>> On Wed, Dec 18, 2024 at 11:41 AM Mike Carey<dtab...@gmail.com> wrote: > > >>> > > >>>> Nice! > > >>>> > > >>>> On 12/13/24 4:32 PM, Calvin Dani wrote: > > >>>>> Hi, > > >>>>> > > >>>>> Regarding the performance testing of the first query for schema > > >>>> inference: > > >>>>> We benchmarked it against contemporary methods, primarily Spark-based > > >>>>> implementations, using a configuration of 2 node controllers and 8 > > >>>>> data partitions. > > >>>>> > > >>>>> For a GitHub dataset of 51GB: > > >>>>> > > >>>>> Our approach inferred the schema in 51.6 seconds, > > >>>>> > > >>>>> Spark’s native implementation took 81.6 seconds, > > >>>>> > > >>>>> Methods by Spoth and Mior required 400+ seconds. > > >>>>> > > >>>>> I hope this is helpful. > > >>>>> > > >>>>> Regards > > >>>>> Calvin Dani > > >>>>> > > >>>>> > > >>>>> On Thu, Dec 12, 2024 at 5:27 AM Mike Carey<dtab...@gmail.com> wrote: > > >>>>> > > >>>>> Question - I think you were doing some perf testing - do you > > have > > >>>>> perf > > >>>>> results for these (vs. the current schema function)? > > >>>>> > > >>>>> On 12/5/24 12:04 PM, Calvin Dani wrote: > > >>>>> > Hi, > > >>>>> > > > >>>>> > Wanted to share an update regarding the features in the APE. > > The > > >>>> two > > >>>>> > queries: > > >>>>> > > > >>>>> > 1. query_schema() > > >>>>> > > > >>>>> > 2. collection_schema() > > >>>>> > > > >>>>> > are now functional. The query_schema() implementation has > > been > > >>>>> submitted > > >>>>> > for review. Once that is approved, I will proceed to submit > > the > > >>>>> > collection_schema() query, as it depends on the first query's > > >> code. > > >>>>> > > > >>>>> > I would greatly appreciate your feedback, additional test > > cases, > > >>>>> and any > > >>>>> > thoughts you have on this APE. I’m eager to refine it further > > >>>>> or, if it > > >>>>> > seems like a solid starting point, to receive approval for > > this > > >>>> APE. > > >>>>> > > > >>>>> > Thank you for your time and input! > > >>>>> > > > >>>>> > Regards > > >>>>> > > > >>>>> > Calvin Dani > > >>>>> > > > >>>>> > On Wed, Nov 6, 2024 at 4:06 PM Calvin > > >>>>> Dani<calvinthomas.d...@gmail.com> > > >>>>> > wrote: > > >>>>> > > > >>>>> >> Hi, > > >>>>> >> > > >>>>> >> The APE has been updated with those changes! > > >>>>> >> > > >>>>> >> Regards > > >>>>> >> Calvin Dani > > >>>>> >> > > >>>>> >> On Fri, Nov 1, 2024 at 10:36 AM Mike Carey< > > dtab...@gmail.com> > > >>>>> wrote: > > >>>>> >> > > >>>>> >>> Excellent! +1 > > >>>>> >>> > > >>>>> >>> On Fri, Nov 1, 2024 at 9:35 AM Calvin > > >>>>> Dani<calvinthomas.d...@gmail.com> > > >>>>> >>> wrote: > > >>>>> >>> > > >>>>> >>>> Hi, > > >>>>> >>>> > > >>>>> >>>> Thank you for the feedback and as per last meeting here > > our > > >>>>> the changes > > >>>>> >>>> that are incorporated to this APE. > > >>>>> >>>> They are as follows: > > >>>>> >>>> 1. Name of the schema inference functions > > >>>>> >>>> 2. Schema inference functionality > > >>>>> >>>> > > >>>>> >>>> The summary of changes are as follows : > > >>>>> >>>> > > >>>>> >>>> 1. query_schema (Aggregate function that takes all > > >>>>> records of the > > >>>>> >>>> subquery and generates a JSON Schema), > > >>>>> >>>> 2. collection_schema (JSON Schema translation of the > > >> defined > > >>>>> >>> datatypes > > >>>>> >>>> in the metadata node) > > >>>>> >>>> 3. current_schema (for columnar stores and converting > > the > > >>>>> inferred > > >>>>> >>>> schema for storage compaction to JSON Schema) > > >>>>> >>>> > > >>>>> >>>> > > >>>>> >>>> Regards > > >>>>> >>>> Calvin Dani > > >>>>> >>>> > > >>>>> >>>> > > >>>>> >>>> On Fri, Oct 4, 2024 at 10:28 AM Mike Carey< > > dtab...@gmail.com > > >>>>> wrote: > > >>>>> >>>> > > >>>>> >>>>> Great feature! I wasn't able to understand the query > > >>>>> example(s), > > >>>>> >>>>> though... Could those be cleaned up a little and > > clarified? > > >>>>> >>>>> > > >>>>> >>>>> Also, I think we might want two functions at the user > > level > > >>>>> - one that > > >>>>> >>>>> takes an expression as input and reports its schema, and > > >>>>> another that > > >>>>> >>>>> takes a dataset/collection name as input and reports its > > >>>>> schema. The > > >>>>> >>>>> first one would scan the results and say what the schema > > is; > > >>>>> the other > > >>>>> >>>>> would use a more efficient approach (accessing and > > combining > > >>>> the > > >>>>> >>>>> metadata from the collection's most recent LSM > > components in > > >>>>> each of > > >>>>> >>> its > > >>>>> >>>>> partitions). > > >>>>> >>>>> > > >>>>> >>>>> Cheers, > > >>>>> >>>>> > > >>>>> >>>>> Mike > > >>>>> >>>>> > > >>>>> >>>>> On 10/4/24 10:13 AM, Calvin Dani wrote: > > >>>>> >>>>>> Initiating the discussion thread proposing a new > > aggregate > > >>>>> function > > >>>>> >>> in > > >>>>> >>>>>> AsterixDB. > > >>>>> >>>>>> *Feature:* aggregate function to infer schema > > >>>>> >>>>>> *Details:* This feature introduces schema inference as > > an > > >>>> SQL++ > > >>>>> >>>> function > > >>>>> >>>>>> directly integrated into AsterixDB. It is the first > > >> approach > > >>>> to > > >>>>> >>> offer > > >>>>> >>>>>> schema inference as a native SQL++ function, allowing > > users > > >>>>> to infer > > >>>>> >>>>>> schemas for not only any dataset but also for queries > > and > > >>>>> >>> subqueries. > > >>>>> >>>> Its > > >>>>> >>>>>> output in JSON Schema, the industry standard, produces > > both > > >>>>> human > > >>>>> >>> and > > >>>>> >>>>>> machine-readable results, suitable for user > > interpretation > > >> or > > >>>>> >>>> integration > > >>>>> >>>>>> into other queries or programs. > > >>>>> >>>>>> > > >>>>> >>>>>> Utilizing the template of array_avg() in the Built-in > > >>>>> Function and > > >>>>> >>>>> Function > > >>>>> >>>>>> collection file the array_schema() was implemented. > > During > > >>>> self > > >>>>> >>>> review, a > > >>>>> >>>>>> lot of defined aggregate functions for > > >>>>> >>>>>> example SerializableAvgAggregateFunction > > >>>>> >>>>>> and IntermediateAvgAggregateFunction are not being > > utilised > > >>>>> during > > >>>>> >>>>>> array_schema() query. Is it due to different use cases > > or > > >> am I > > >>>>> >>>> utilising > > >>>>> >>>>> it > > >>>>> >>>>>> incorrectly? > > >>>>> >>>>>> > > >>>>> >>>>>> Are there any resources to understand the functionality > > of > > >>>>> aggregate > > >>>>> >>>>>> functions in the implementation? > > >>>>> >>>>>> > > >>>>> >>>>>> *APE* > > >>>>> >>>>>> > > >>>>> >>> > > >>>>> > > >> > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*8*3A*Schema*Inference*Aggregate*Functions__;KyUrKysr!!MLMg-p0Z!HkhB6GEhDbMZCMLBJ-D-xVzlEWogaW_K1Q-PA5k7uWP9j65SjJ5GrpCfjkcY5JrwLFniY7bed7dan7Lt$ > > > > -- > Shiva Jahangiri > Assistant Professor in Computer Science and Engineering Department > Santa Clara University