+1 (non-binding) El vie, 1 may 2026, 21:29, Yingyi Bu <[email protected]> escribió:
> +1 (non-binding) > > Best, > Yingyi > > On Fri, May 1, 2026 at 11:33 AM Anish Shrigondekar via dev < > [email protected]> wrote: > >> +1 (non-binding) >> >> Would also be interesting to see how we could add streaming support for >> this operator in the future as well >> >> Thanks, >> Anish >> >> On Fri, May 1, 2026 at 10:42 AM Menelaos Karavelas < >> [email protected]> wrote: >> >>> +1 (non-binding) >>> >>> >>> On May 1, 2026, at 10:31 AM, Gengliang Wang <[email protected]> wrote: >>> >>> +1 >>> >>> On Wed, Apr 29, 2026 at 8:20 AM Peter Toth <[email protected]> wrote: >>> >>>> +1 (non-binding) >>>> >>>> On Wed, Apr 29, 2026 at 4:33 PM Antônio Marcos Souza Pereira < >>>> [email protected]> wrote: >>>> >>>>> +1 >>>>> >>>>> >>>>> On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected]> >>>>> wrote: >>>>> >>>>>> +1 >>>>>> >>>>>> On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> +1 >>>>>>> >>>>>>> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking >>>>>>>> Join*. >>>>>>>> >>>>>>>> *Motivation* >>>>>>>> Top-K nearest neighbor search is a fundamental building block for >>>>>>>> semantic search, retrieval-augmented generation (RAG), recommendation >>>>>>>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL >>>>>>>> users >>>>>>>> have to express this pattern through verbose CROSS JOIN + window >>>>>>>> function >>>>>>>> or max_by/min_by workarounds - patterns that materialize the full >>>>>>>> Cartesian >>>>>>>> product and give the optimizer no semantic signal for specialized >>>>>>>> execution >>>>>>>> strategies. >>>>>>>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL >>>>>>>> with pgvector) all provide dedicated primitives for this. Spark >>>>>>>> currently >>>>>>>> does not. >>>>>>>> >>>>>>>> *Proposal* >>>>>>>> This SPIP proposes extending standard SQL JOIN syntax with a >>>>>>>> NEAREST ... BY clause for top-K ranking joins. The BY expression is >>>>>>>> pluggable - vector similarity, geometric distance, BM25, or any >>>>>>>> composite >>>>>>>> scoring expression - making the same syntax usable across vector >>>>>>>> search, >>>>>>>> geospatial, and text retrieval use cases. The APPROX / EXACT keywords >>>>>>>> make >>>>>>>> the search algorithm contract explicit, ensuring future index creation >>>>>>>> or >>>>>>>> deletion never silently changes query results. >>>>>>>> >>>>>>>> The initial scope covers SQL syntax, brute-force exact execution >>>>>>>> (rewritten into existing physical operators: JOIN + max_by/min_by with >>>>>>>> K >>>>>>>> overload), and Spark Connect / PySpark API support. Vector index DDL >>>>>>>> and >>>>>>>> indexed ANN execution are deferred as future work. >>>>>>>> >>>>>>>> *Example SQL*: >>>>>>>> >>>>>>>> sql >>>>>>>> -- Batch vector search: find the 10 most similar products for each >>>>>>>> user >>>>>>>> SELECT q.user_id, t.* >>>>>>>> FROM users q >>>>>>>> INNER JOIN products t >>>>>>>> APPROX NEAREST 10 BY SIMILARITY >>>>>>>> vector_cosine_similarity(q.embedding, t.embedding) >>>>>>>> >>>>>>>> *Relevant Links* >>>>>>>> >>>>>>>> SPIP Document: >>>>>>>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7 >>>>>>>> Discussion Thread: >>>>>>>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm >>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395 >>>>>>>> >>>>>>>> The vote will be open for at least 72 hours. >>>>>>>> Please vote: >>>>>>>> [ ] +1: Accept the proposal as an official SPIP >>>>>>>> [ ] +0 >>>>>>>> [ ] -1: I don't think this is a good idea because ... >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Zhidong (Zero) Qu >>>>>>>> Software Engineer >>>>>>>> AI System >>>>>>>> >>>>>>>> >>>
