+1
On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected]> wrote: > +1 > > On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]> wrote: > >> +1 >> >> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]> >> wrote: >> >>> Hi all, >>> >>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*. >>> >>> *Motivation* >>> Top-K nearest neighbor search is a fundamental building block for >>> semantic search, retrieval-augmented generation (RAG), recommendation >>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users >>> have to express this pattern through verbose CROSS JOIN + window function >>> or max_by/min_by workarounds - patterns that materialize the full Cartesian >>> product and give the optimizer no semantic signal for specialized execution >>> strategies. >>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL with >>> pgvector) all provide dedicated primitives for this. Spark currently does >>> not. >>> >>> *Proposal* >>> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST ... >>> BY clause for top-K ranking joins. The BY expression is pluggable - vector >>> similarity, geometric distance, BM25, or any composite scoring expression - >>> making the same syntax usable across vector search, geospatial, and text >>> retrieval use cases. The APPROX / EXACT keywords make the search algorithm >>> contract explicit, ensuring future index creation or deletion never >>> silently changes query results. >>> >>> The initial scope covers SQL syntax, brute-force exact execution >>> (rewritten into existing physical operators: JOIN + max_by/min_by with K >>> overload), and Spark Connect / PySpark API support. Vector index DDL and >>> indexed ANN execution are deferred as future work. >>> >>> *Example SQL*: >>> >>> sql >>> -- Batch vector search: find the 10 most similar products for each user >>> SELECT q.user_id, t.* >>> FROM users q >>> INNER JOIN products t >>> APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding, >>> t.embedding) >>> >>> *Relevant Links* >>> >>> SPIP Document: >>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7 >>> Discussion Thread: >>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm >>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395 >>> >>> The vote will be open for at least 72 hours. >>> Please vote: >>> [ ] +1: Accept the proposal as an official SPIP >>> [ ] +0 >>> [ ] -1: I don't think this is a good idea because ... >>> Cheers, >>> >>> Zhidong (Zero) Qu >>> Software Engineer >>> AI System >>> >>>
