+1 (non-binding) On Wed, Apr 29, 2026 at 4:33 PM Antônio Marcos Souza Pereira < [email protected]> wrote:
> +1 > > > On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected]> wrote: > >> +1 >> >> On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]> wrote: >> >>> +1 >>> >>> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> >>>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*. >>>> >>>> *Motivation* >>>> Top-K nearest neighbor search is a fundamental building block for >>>> semantic search, retrieval-augmented generation (RAG), recommendation >>>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users >>>> have to express this pattern through verbose CROSS JOIN + window function >>>> or max_by/min_by workarounds - patterns that materialize the full Cartesian >>>> product and give the optimizer no semantic signal for specialized execution >>>> strategies. >>>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL >>>> with pgvector) all provide dedicated primitives for this. Spark currently >>>> does not. >>>> >>>> *Proposal* >>>> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST >>>> ... BY clause for top-K ranking joins. The BY expression is pluggable - >>>> vector similarity, geometric distance, BM25, or any composite scoring >>>> expression - making the same syntax usable across vector search, >>>> geospatial, and text retrieval use cases. The APPROX / EXACT keywords make >>>> the search algorithm contract explicit, ensuring future index creation or >>>> deletion never silently changes query results. >>>> >>>> The initial scope covers SQL syntax, brute-force exact execution >>>> (rewritten into existing physical operators: JOIN + max_by/min_by with K >>>> overload), and Spark Connect / PySpark API support. Vector index DDL and >>>> indexed ANN execution are deferred as future work. >>>> >>>> *Example SQL*: >>>> >>>> sql >>>> -- Batch vector search: find the 10 most similar products for each user >>>> SELECT q.user_id, t.* >>>> FROM users q >>>> INNER JOIN products t >>>> APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding, >>>> t.embedding) >>>> >>>> *Relevant Links* >>>> >>>> SPIP Document: >>>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7 >>>> Discussion Thread: >>>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395 >>>> >>>> The vote will be open for at least 72 hours. >>>> Please vote: >>>> [ ] +1: Accept the proposal as an official SPIP >>>> [ ] +0 >>>> [ ] -1: I don't think this is a good idea because ... >>>> Cheers, >>>> >>>> Zhidong (Zero) Qu >>>> Software Engineer >>>> AI System >>>> >>>>
