+1 On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]> wrote:
> +1 > > On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]> > wrote: > >> Hi all, >> >> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*. >> >> *Motivation* >> Top-K nearest neighbor search is a fundamental building block for >> semantic search, retrieval-augmented generation (RAG), recommendation >> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users >> have to express this pattern through verbose CROSS JOIN + window function >> or max_by/min_by workarounds - patterns that materialize the full Cartesian >> product and give the optimizer no semantic signal for specialized execution >> strategies. >> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL with >> pgvector) all provide dedicated primitives for this. Spark currently does >> not. >> >> *Proposal* >> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST ... >> BY clause for top-K ranking joins. The BY expression is pluggable - vector >> similarity, geometric distance, BM25, or any composite scoring expression - >> making the same syntax usable across vector search, geospatial, and text >> retrieval use cases. The APPROX / EXACT keywords make the search algorithm >> contract explicit, ensuring future index creation or deletion never >> silently changes query results. >> >> The initial scope covers SQL syntax, brute-force exact execution >> (rewritten into existing physical operators: JOIN + max_by/min_by with K >> overload), and Spark Connect / PySpark API support. Vector index DDL and >> indexed ANN execution are deferred as future work. >> >> *Example SQL*: >> >> sql >> -- Batch vector search: find the 10 most similar products for each user >> SELECT q.user_id, t.* >> FROM users q >> INNER JOIN products t >> APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding, >> t.embedding) >> >> *Relevant Links* >> >> SPIP Document: >> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7 >> Discussion Thread: >> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm >> JIRA: https://issues.apache.org/jira/browse/SPARK-56395 >> >> The vote will be open for at least 72 hours. >> Please vote: >> [ ] +1: Accept the proposal as an official SPIP >> [ ] +0 >> [ ] -1: I don't think this is a good idea because ... >> Cheers, >> >> Zhidong (Zero) Qu >> Software Engineer >> AI System >> >>
