+1

On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]> wrote:

> +1
>
> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]>
> wrote:
>
>> Hi all,
>>
>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*.
>>
>> *Motivation*
>> Top-K nearest neighbor search is a fundamental building block for
>> semantic search, retrieval-augmented generation (RAG), recommendation
>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users
>> have to express this pattern through verbose CROSS JOIN + window function
>> or max_by/min_by workarounds - patterns that materialize the full Cartesian
>> product and give the optimizer no semantic signal for specialized execution
>> strategies.
>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL with
>> pgvector) all provide dedicated primitives for this. Spark currently does
>> not.
>>
>> *Proposal*
>> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST ...
>> BY clause for top-K ranking joins. The BY expression is pluggable - vector
>> similarity, geometric distance, BM25, or any composite scoring expression -
>> making the same syntax usable across vector search, geospatial, and text
>> retrieval use cases. The APPROX / EXACT keywords make the search algorithm
>> contract explicit, ensuring future index creation or deletion never
>> silently changes query results.
>>
>> The initial scope covers SQL syntax, brute-force exact execution
>> (rewritten into existing physical operators: JOIN + max_by/min_by with K
>> overload), and Spark Connect / PySpark API support. Vector index DDL and
>> indexed ANN execution are deferred as future work.
>>
>> *Example SQL*:
>>
>> sql
>> -- Batch vector search: find the 10 most similar products for each user
>> SELECT q.user_id, t.*
>> FROM users q
>> INNER JOIN products t
>>   APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
>> t.embedding)
>>
>> *Relevant Links*
>>
>> SPIP Document:
>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7
>> Discussion Thread:
>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm
>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395
>>
>> The vote will be open for at least 72 hours.
>> Please vote:
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>> Cheers,
>>
>> Zhidong (Zero) Qu
>> Software Engineer
>> AI System
>>
>>

Reply via email to