+1 (non-binding) Would also be interesting to see how we could add streaming support for this operator in the future as well
Thanks, Anish On Fri, May 1, 2026 at 10:42 AM Menelaos Karavelas < [email protected]> wrote: > +1 (non-binding) > > > On May 1, 2026, at 10:31 AM, Gengliang Wang <[email protected]> wrote: > > +1 > > On Wed, Apr 29, 2026 at 8:20 AM Peter Toth <[email protected]> wrote: > >> +1 (non-binding) >> >> On Wed, Apr 29, 2026 at 4:33 PM Antônio Marcos Souza Pereira < >> [email protected]> wrote: >> >>> +1 >>> >>> >>> On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected]> >>> wrote: >>> >>>> +1 >>>> >>>> On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]> >>>> wrote: >>>> >>>>> +1 >>>>> >>>>> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*. >>>>>> >>>>>> *Motivation* >>>>>> Top-K nearest neighbor search is a fundamental building block for >>>>>> semantic search, retrieval-augmented generation (RAG), recommendation >>>>>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users >>>>>> have to express this pattern through verbose CROSS JOIN + window function >>>>>> or max_by/min_by workarounds - patterns that materialize the full >>>>>> Cartesian >>>>>> product and give the optimizer no semantic signal for specialized >>>>>> execution >>>>>> strategies. >>>>>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL >>>>>> with pgvector) all provide dedicated primitives for this. Spark currently >>>>>> does not. >>>>>> >>>>>> *Proposal* >>>>>> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST >>>>>> ... BY clause for top-K ranking joins. The BY expression is pluggable - >>>>>> vector similarity, geometric distance, BM25, or any composite scoring >>>>>> expression - making the same syntax usable across vector search, >>>>>> geospatial, and text retrieval use cases. The APPROX / EXACT keywords >>>>>> make >>>>>> the search algorithm contract explicit, ensuring future index creation or >>>>>> deletion never silently changes query results. >>>>>> >>>>>> The initial scope covers SQL syntax, brute-force exact execution >>>>>> (rewritten into existing physical operators: JOIN + max_by/min_by with K >>>>>> overload), and Spark Connect / PySpark API support. Vector index DDL and >>>>>> indexed ANN execution are deferred as future work. >>>>>> >>>>>> *Example SQL*: >>>>>> >>>>>> sql >>>>>> -- Batch vector search: find the 10 most similar products for each >>>>>> user >>>>>> SELECT q.user_id, t.* >>>>>> FROM users q >>>>>> INNER JOIN products t >>>>>> APPROX NEAREST 10 BY SIMILARITY >>>>>> vector_cosine_similarity(q.embedding, t.embedding) >>>>>> >>>>>> *Relevant Links* >>>>>> >>>>>> SPIP Document: >>>>>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7 >>>>>> Discussion Thread: >>>>>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm >>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395 >>>>>> >>>>>> The vote will be open for at least 72 hours. >>>>>> Please vote: >>>>>> [ ] +1: Accept the proposal as an official SPIP >>>>>> [ ] +0 >>>>>> [ ] -1: I don't think this is a good idea because ... >>>>>> Cheers, >>>>>> >>>>>> Zhidong (Zero) Qu >>>>>> Software Engineer >>>>>> AI System >>>>>> >>>>>> >
