Re: [VOTE] SPIP: NEAREST BY Top-K Ranking Join

Anish Shrigondekar via dev Fri, 01 May 2026 11:33:27 -0700

+1 (non-binding)

Would also be interesting to see how we could add streaming support for
this operator in the future as well


Thanks,
Anish

On Fri, May 1, 2026 at 10:42 AM Menelaos Karavelas <
[email protected]> wrote:

> +1 (non-binding)
>
>
> On May 1, 2026, at 10:31 AM, Gengliang Wang <[email protected]> wrote:
>
> +1
>
> On Wed, Apr 29, 2026 at 8:20 AM Peter Toth <[email protected]> wrote:
>
>> +1 (non-binding)
>>
>> On Wed, Apr 29, 2026 at 4:33 PM Antônio Marcos Souza Pereira <
>> [email protected]> wrote:
>>
>>> +1
>>>
>>>
>>> On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected]>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*.
>>>>>>
>>>>>> *Motivation*
>>>>>> Top-K nearest neighbor search is a fundamental building block for
>>>>>> semantic search, retrieval-augmented generation (RAG), recommendation
>>>>>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users
>>>>>> have to express this pattern through verbose CROSS JOIN + window function
>>>>>> or max_by/min_by workarounds - patterns that materialize the full 
>>>>>> Cartesian
>>>>>> product and give the optimizer no semantic signal for specialized 
>>>>>> execution
>>>>>> strategies.
>>>>>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL
>>>>>> with pgvector) all provide dedicated primitives for this. Spark currently
>>>>>> does not.
>>>>>>
>>>>>> *Proposal*
>>>>>> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST
>>>>>> ... BY clause for top-K ranking joins. The BY expression is pluggable -
>>>>>> vector similarity, geometric distance, BM25, or any composite scoring
>>>>>> expression - making the same syntax usable across vector search,
>>>>>> geospatial, and text retrieval use cases. The APPROX / EXACT keywords 
>>>>>> make
>>>>>> the search algorithm contract explicit, ensuring future index creation or
>>>>>> deletion never silently changes query results.
>>>>>>
>>>>>> The initial scope covers SQL syntax, brute-force exact execution
>>>>>> (rewritten into existing physical operators: JOIN + max_by/min_by with K
>>>>>> overload), and Spark Connect / PySpark API support. Vector index DDL and
>>>>>> indexed ANN execution are deferred as future work.
>>>>>>
>>>>>> *Example SQL*:
>>>>>>
>>>>>> sql
>>>>>> -- Batch vector search: find the 10 most similar products for each
>>>>>> user
>>>>>> SELECT q.user_id, t.*
>>>>>> FROM users q
>>>>>> INNER JOIN products t
>>>>>>   APPROX NEAREST 10 BY SIMILARITY
>>>>>> vector_cosine_similarity(q.embedding, t.embedding)
>>>>>>
>>>>>> *Relevant Links*
>>>>>>
>>>>>> SPIP Document:
>>>>>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7
>>>>>> Discussion Thread:
>>>>>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm
>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395
>>>>>>
>>>>>> The vote will be open for at least 72 hours.
>>>>>> Please vote:
>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>> [ ] +0
>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>> Cheers,
>>>>>>
>>>>>> Zhidong (Zero) Qu
>>>>>> Software Engineer
>>>>>> AI System
>>>>>>
>>>>>>
>

Re: [VOTE] SPIP: NEAREST BY Top-K Ranking Join

Reply via email to