+1 On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]> wrote:
> Hi all, > > I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*. > > *Motivation* > Top-K nearest neighbor search is a fundamental building block for semantic > search, retrieval-augmented generation (RAG), recommendation systems, and > geospatial nearest-neighbor queries. Today, Spark SQL users have to express > this pattern through verbose CROSS JOIN + window function or max_by/min_by > workarounds - patterns that materialize the full Cartesian product and give > the optimizer no semantic signal for specialized execution strategies. > Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL with > pgvector) all provide dedicated primitives for this. Spark currently does > not. > > *Proposal* > This SPIP proposes extending standard SQL JOIN syntax with a NEAREST ... > BY clause for top-K ranking joins. The BY expression is pluggable - vector > similarity, geometric distance, BM25, or any composite scoring expression - > making the same syntax usable across vector search, geospatial, and text > retrieval use cases. The APPROX / EXACT keywords make the search algorithm > contract explicit, ensuring future index creation or deletion never > silently changes query results. > > The initial scope covers SQL syntax, brute-force exact execution > (rewritten into existing physical operators: JOIN + max_by/min_by with K > overload), and Spark Connect / PySpark API support. Vector index DDL and > indexed ANN execution are deferred as future work. > > *Example SQL*: > > sql > -- Batch vector search: find the 10 most similar products for each user > SELECT q.user_id, t.* > FROM users q > INNER JOIN products t > APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding, > t.embedding) > > *Relevant Links* > > SPIP Document: > https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7 > Discussion Thread: > https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm > JIRA: https://issues.apache.org/jira/browse/SPARK-56395 > > The vote will be open for at least 72 hours. > Please vote: > [ ] +1: Accept the proposal as an official SPIP > [ ] +0 > [ ] -1: I don't think this is a good idea because ... > Cheers, > > Zhidong (Zero) Qu > Software Engineer > AI System > >
