+1

On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected]> wrote:

> +1
>
> On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]> wrote:
>
>> +1
>>
>> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*.
>>>
>>> *Motivation*
>>> Top-K nearest neighbor search is a fundamental building block for
>>> semantic search, retrieval-augmented generation (RAG), recommendation
>>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users
>>> have to express this pattern through verbose CROSS JOIN + window function
>>> or max_by/min_by workarounds - patterns that materialize the full Cartesian
>>> product and give the optimizer no semantic signal for specialized execution
>>> strategies.
>>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL with
>>> pgvector) all provide dedicated primitives for this. Spark currently does
>>> not.
>>>
>>> *Proposal*
>>> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST ...
>>> BY clause for top-K ranking joins. The BY expression is pluggable - vector
>>> similarity, geometric distance, BM25, or any composite scoring expression -
>>> making the same syntax usable across vector search, geospatial, and text
>>> retrieval use cases. The APPROX / EXACT keywords make the search algorithm
>>> contract explicit, ensuring future index creation or deletion never
>>> silently changes query results.
>>>
>>> The initial scope covers SQL syntax, brute-force exact execution
>>> (rewritten into existing physical operators: JOIN + max_by/min_by with K
>>> overload), and Spark Connect / PySpark API support. Vector index DDL and
>>> indexed ANN execution are deferred as future work.
>>>
>>> *Example SQL*:
>>>
>>> sql
>>> -- Batch vector search: find the 10 most similar products for each user
>>> SELECT q.user_id, t.*
>>> FROM users q
>>> INNER JOIN products t
>>>   APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
>>> t.embedding)
>>>
>>> *Relevant Links*
>>>
>>> SPIP Document:
>>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7
>>> Discussion Thread:
>>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm
>>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395
>>>
>>> The vote will be open for at least 72 hours.
>>> Please vote:
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don't think this is a good idea because ...
>>> Cheers,
>>>
>>> Zhidong (Zero) Qu
>>> Software Engineer
>>> AI System
>>>
>>>

Reply via email to