Re: [VOTE] SPIP: NEAREST BY Top-K Ranking Join

Menelaos Karavelas Fri, 01 May 2026 10:40:36 -0700

+1 (non-binding)


> On May 1, 2026, at 10:31 AM, Gengliang Wang <[email protected]> wrote:
> 
> +1
> 
> On Wed, Apr 29, 2026 at 8:20 AM Peter Toth <[email protected] 
> <mailto:[email protected]>> wrote:
>> +1 (non-binding)
>> 
>> On Wed, Apr 29, 2026 at 4:33 PM Antônio Marcos Souza Pereira 
>> <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> +1
>>> 
>>> 
>>> On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>> +1
>>>> 
>>>> On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>>> +1
>>>>> 
>>>>> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>>> Hi all,
>>>>>> 
>>>>>> I'd like to call a vote on the SPIP: NEAREST BY Top-K Ranking Join.
>>>>>> 
>>>>>> Motivation
>>>>>> Top-K nearest neighbor search is a fundamental building block for 
>>>>>> semantic search, retrieval-augmented generation (RAG), recommendation 
>>>>>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users 
>>>>>> have to express this pattern through verbose CROSS JOIN + window 
>>>>>> function or max_by/min_by workarounds - patterns that materialize the 
>>>>>> full Cartesian product and give the optimizer no semantic signal for 
>>>>>> specialized execution strategies.
>>>>>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL with 
>>>>>> pgvector) all provide dedicated primitives for this. Spark currently 
>>>>>> does not.
>>>>>> 
>>>>>> Proposal
>>>>>> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST ... 
>>>>>> BY clause for top-K ranking joins. The BY expression is pluggable - 
>>>>>> vector similarity, geometric distance, BM25, or any composite scoring 
>>>>>> expression - making the same syntax usable across vector search, 
>>>>>> geospatial, and text retrieval use cases. The APPROX / EXACT keywords 
>>>>>> make the search algorithm contract explicit, ensuring future index 
>>>>>> creation or deletion never silently changes query results.
>>>>>> 
>>>>>> The initial scope covers SQL syntax, brute-force exact execution 
>>>>>> (rewritten into existing physical operators: JOIN + max_by/min_by with K 
>>>>>> overload), and Spark Connect / PySpark API support. Vector index DDL and 
>>>>>> indexed ANN execution are deferred as future work.
>>>>>> 
>>>>>> Example SQL:
>>>>>> 
>>>>>> sql
>>>>>> -- Batch vector search: find the 10 most similar products for each user
>>>>>> SELECT q.user_id, t.*
>>>>>> FROM users q
>>>>>> INNER JOIN products t
>>>>>>   APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding, 
>>>>>> t.embedding)
>>>>>> 
>>>>>> Relevant Links
>>>>>> 
>>>>>> SPIP Document: 
>>>>>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7
>>>>>> Discussion Thread: 
>>>>>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm
>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395
>>>>>> 
>>>>>> The vote will be open for at least 72 hours.
>>>>>> Please vote:
>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>> [ ] +0
>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>> Cheers,
>>>>>> 
>>>>>> Zhidong (Zero) Qu
>>>>>> Software Engineer
>>>>>> AI System
>>>>>>

Re: [VOTE] SPIP: NEAREST BY Top-K Ranking Join

Reply via email to