Re: [VOTE] SPIP: NEAREST BY Top-K Ranking Join

Yingyi Bu Fri, 01 May 2026 12:29:14 -0700

+1 (non-binding)

Best,
Yingyi


On Fri, May 1, 2026 at 11:33 AM Anish Shrigondekar via dev <
[email protected]> wrote:

> +1 (non-binding)
>
> Would also be interesting to see how we could add streaming support for
> this operator in the future as well
>
> Thanks,
> Anish
>
> On Fri, May 1, 2026 at 10:42 AM Menelaos Karavelas <
> [email protected]> wrote:
>
>> +1 (non-binding)
>>
>>
>> On May 1, 2026, at 10:31 AM, Gengliang Wang <[email protected]> wrote:
>>
>> +1
>>
>> On Wed, Apr 29, 2026 at 8:20 AM Peter Toth <[email protected]> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Wed, Apr 29, 2026 at 4:33 PM Antônio Marcos Souza Pereira <
>>> [email protected]> wrote:
>>>
>>>> +1
>>>>
>>>>
>>>> On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected]>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*
>>>>>>> .
>>>>>>>
>>>>>>> *Motivation*
>>>>>>> Top-K nearest neighbor search is a fundamental building block for
>>>>>>> semantic search, retrieval-augmented generation (RAG), recommendation
>>>>>>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users
>>>>>>> have to express this pattern through verbose CROSS JOIN + window 
>>>>>>> function
>>>>>>> or max_by/min_by workarounds - patterns that materialize the full 
>>>>>>> Cartesian
>>>>>>> product and give the optimizer no semantic signal for specialized 
>>>>>>> execution
>>>>>>> strategies.
>>>>>>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL
>>>>>>> with pgvector) all provide dedicated primitives for this. Spark 
>>>>>>> currently
>>>>>>> does not.
>>>>>>>
>>>>>>> *Proposal*
>>>>>>> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST
>>>>>>> ... BY clause for top-K ranking joins. The BY expression is pluggable -
>>>>>>> vector similarity, geometric distance, BM25, or any composite scoring
>>>>>>> expression - making the same syntax usable across vector search,
>>>>>>> geospatial, and text retrieval use cases. The APPROX / EXACT keywords 
>>>>>>> make
>>>>>>> the search algorithm contract explicit, ensuring future index creation 
>>>>>>> or
>>>>>>> deletion never silently changes query results.
>>>>>>>
>>>>>>> The initial scope covers SQL syntax, brute-force exact execution
>>>>>>> (rewritten into existing physical operators: JOIN + max_by/min_by with K
>>>>>>> overload), and Spark Connect / PySpark API support. Vector index DDL and
>>>>>>> indexed ANN execution are deferred as future work.
>>>>>>>
>>>>>>> *Example SQL*:
>>>>>>>
>>>>>>> sql
>>>>>>> -- Batch vector search: find the 10 most similar products for each
>>>>>>> user
>>>>>>> SELECT q.user_id, t.*
>>>>>>> FROM users q
>>>>>>> INNER JOIN products t
>>>>>>>   APPROX NEAREST 10 BY SIMILARITY
>>>>>>> vector_cosine_similarity(q.embedding, t.embedding)
>>>>>>>
>>>>>>> *Relevant Links*
>>>>>>>
>>>>>>> SPIP Document:
>>>>>>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7
>>>>>>> Discussion Thread:
>>>>>>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm
>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395
>>>>>>>
>>>>>>> The vote will be open for at least 72 hours.
>>>>>>> Please vote:
>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>> [ ] +0
>>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Zhidong (Zero) Qu
>>>>>>> Software Engineer
>>>>>>> AI System
>>>>>>>
>>>>>>>
>>

Re: [VOTE] SPIP: NEAREST BY Top-K Ranking Join

Reply via email to