+1 On Wed, Apr 29, 2026 at 8:20 AM Peter Toth <[email protected]> wrote:
> +1 (non-binding) > > On Wed, Apr 29, 2026 at 4:33 PM Antônio Marcos Souza Pereira < > [email protected]> wrote: > >> +1 >> >> >> On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected]> >> wrote: >> >>> +1 >>> >>> On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]> wrote: >>> >>>> +1 >>>> >>>> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*. >>>>> >>>>> *Motivation* >>>>> Top-K nearest neighbor search is a fundamental building block for >>>>> semantic search, retrieval-augmented generation (RAG), recommendation >>>>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users >>>>> have to express this pattern through verbose CROSS JOIN + window function >>>>> or max_by/min_by workarounds - patterns that materialize the full >>>>> Cartesian >>>>> product and give the optimizer no semantic signal for specialized >>>>> execution >>>>> strategies. >>>>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL >>>>> with pgvector) all provide dedicated primitives for this. Spark currently >>>>> does not. >>>>> >>>>> *Proposal* >>>>> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST >>>>> ... BY clause for top-K ranking joins. The BY expression is pluggable - >>>>> vector similarity, geometric distance, BM25, or any composite scoring >>>>> expression - making the same syntax usable across vector search, >>>>> geospatial, and text retrieval use cases. The APPROX / EXACT keywords make >>>>> the search algorithm contract explicit, ensuring future index creation or >>>>> deletion never silently changes query results. >>>>> >>>>> The initial scope covers SQL syntax, brute-force exact execution >>>>> (rewritten into existing physical operators: JOIN + max_by/min_by with K >>>>> overload), and Spark Connect / PySpark API support. Vector index DDL and >>>>> indexed ANN execution are deferred as future work. >>>>> >>>>> *Example SQL*: >>>>> >>>>> sql >>>>> -- Batch vector search: find the 10 most similar products for each user >>>>> SELECT q.user_id, t.* >>>>> FROM users q >>>>> INNER JOIN products t >>>>> APPROX NEAREST 10 BY SIMILARITY >>>>> vector_cosine_similarity(q.embedding, t.embedding) >>>>> >>>>> *Relevant Links* >>>>> >>>>> SPIP Document: >>>>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7 >>>>> Discussion Thread: >>>>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm >>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395 >>>>> >>>>> The vote will be open for at least 72 hours. >>>>> Please vote: >>>>> [ ] +1: Accept the proposal as an official SPIP >>>>> [ ] +0 >>>>> [ ] -1: I don't think this is a good idea because ... >>>>> Cheers, >>>>> >>>>> Zhidong (Zero) Qu >>>>> Software Engineer >>>>> AI System >>>>> >>>>>
