+1 (non-binding) Best, Yingyi
On Fri, May 1, 2026 at 11:33 AM Anish Shrigondekar via dev < [email protected]> wrote: > +1 (non-binding) > > Would also be interesting to see how we could add streaming support for > this operator in the future as well > > Thanks, > Anish > > On Fri, May 1, 2026 at 10:42 AM Menelaos Karavelas < > [email protected]> wrote: > >> +1 (non-binding) >> >> >> On May 1, 2026, at 10:31 AM, Gengliang Wang <[email protected]> wrote: >> >> +1 >> >> On Wed, Apr 29, 2026 at 8:20 AM Peter Toth <[email protected]> wrote: >> >>> +1 (non-binding) >>> >>> On Wed, Apr 29, 2026 at 4:33 PM Antônio Marcos Souza Pereira < >>> [email protected]> wrote: >>> >>>> +1 >>>> >>>> >>>> On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected]> >>>> wrote: >>>> >>>>> +1 >>>>> >>>>> On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]> >>>>> wrote: >>>>> >>>>>> +1 >>>>>> >>>>>> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join* >>>>>>> . >>>>>>> >>>>>>> *Motivation* >>>>>>> Top-K nearest neighbor search is a fundamental building block for >>>>>>> semantic search, retrieval-augmented generation (RAG), recommendation >>>>>>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users >>>>>>> have to express this pattern through verbose CROSS JOIN + window >>>>>>> function >>>>>>> or max_by/min_by workarounds - patterns that materialize the full >>>>>>> Cartesian >>>>>>> product and give the optimizer no semantic signal for specialized >>>>>>> execution >>>>>>> strategies. >>>>>>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL >>>>>>> with pgvector) all provide dedicated primitives for this. Spark >>>>>>> currently >>>>>>> does not. >>>>>>> >>>>>>> *Proposal* >>>>>>> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST >>>>>>> ... BY clause for top-K ranking joins. The BY expression is pluggable - >>>>>>> vector similarity, geometric distance, BM25, or any composite scoring >>>>>>> expression - making the same syntax usable across vector search, >>>>>>> geospatial, and text retrieval use cases. The APPROX / EXACT keywords >>>>>>> make >>>>>>> the search algorithm contract explicit, ensuring future index creation >>>>>>> or >>>>>>> deletion never silently changes query results. >>>>>>> >>>>>>> The initial scope covers SQL syntax, brute-force exact execution >>>>>>> (rewritten into existing physical operators: JOIN + max_by/min_by with K >>>>>>> overload), and Spark Connect / PySpark API support. Vector index DDL and >>>>>>> indexed ANN execution are deferred as future work. >>>>>>> >>>>>>> *Example SQL*: >>>>>>> >>>>>>> sql >>>>>>> -- Batch vector search: find the 10 most similar products for each >>>>>>> user >>>>>>> SELECT q.user_id, t.* >>>>>>> FROM users q >>>>>>> INNER JOIN products t >>>>>>> APPROX NEAREST 10 BY SIMILARITY >>>>>>> vector_cosine_similarity(q.embedding, t.embedding) >>>>>>> >>>>>>> *Relevant Links* >>>>>>> >>>>>>> SPIP Document: >>>>>>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7 >>>>>>> Discussion Thread: >>>>>>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm >>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395 >>>>>>> >>>>>>> The vote will be open for at least 72 hours. >>>>>>> Please vote: >>>>>>> [ ] +1: Accept the proposal as an official SPIP >>>>>>> [ ] +0 >>>>>>> [ ] -1: I don't think this is a good idea because ... >>>>>>> Cheers, >>>>>>> >>>>>>> Zhidong (Zero) Qu >>>>>>> Software Engineer >>>>>>> AI System >>>>>>> >>>>>>> >>
