Hi all, I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*.
*Motivation* Top-K nearest neighbor search is a fundamental building block for semantic search, retrieval-augmented generation (RAG), recommendation systems, and geospatial nearest-neighbor queries. Today, Spark SQL users have to express this pattern through verbose CROSS JOIN + window function or max_by/min_by workarounds - patterns that materialize the full Cartesian product and give the optimizer no semantic signal for specialized execution strategies. Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL with pgvector) all provide dedicated primitives for this. Spark currently does not. *Proposal* This SPIP proposes extending standard SQL JOIN syntax with a NEAREST ... BY clause for top-K ranking joins. The BY expression is pluggable - vector similarity, geometric distance, BM25, or any composite scoring expression - making the same syntax usable across vector search, geospatial, and text retrieval use cases. The APPROX / EXACT keywords make the search algorithm contract explicit, ensuring future index creation or deletion never silently changes query results. The initial scope covers SQL syntax, brute-force exact execution (rewritten into existing physical operators: JOIN + max_by/min_by with K overload), and Spark Connect / PySpark API support. Vector index DDL and indexed ANN execution are deferred as future work. *Example SQL*: sql -- Batch vector search: find the 10 most similar products for each user SELECT q.user_id, t.* FROM users q INNER JOIN products t APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding) *Relevant Links* SPIP Document: https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7 Discussion Thread: https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm JIRA: https://issues.apache.org/jira/browse/SPARK-56395 The vote will be open for at least 72 hours. Please vote: [ ] +1: Accept the proposal as an official SPIP [ ] +0 [ ] -1: I don't think this is a good idea because ... Cheers, Zhidong (Zero) Qu Software Engineer AI System
