Re: [VOTE] SPIP: NEAREST BY Top-K Ranking Join

Gengliang Wang Fri, 01 May 2026 10:33:29 -0700

+1

On Wed, Apr 29, 2026 at 8:20 AM Peter Toth <[email protected]> wrote:


> +1 (non-binding)
>
> On Wed, Apr 29, 2026 at 4:33 PM Antônio Marcos Souza Pereira <
> [email protected]> wrote:
>
>> +1
>>
>>
>> On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected]>
>> wrote:
>>
>>> +1
>>>
>>> On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]> wrote:
>>>
>>>> +1
>>>>
>>>> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*.
>>>>>
>>>>> *Motivation*
>>>>> Top-K nearest neighbor search is a fundamental building block for
>>>>> semantic search, retrieval-augmented generation (RAG), recommendation
>>>>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users
>>>>> have to express this pattern through verbose CROSS JOIN + window function
>>>>> or max_by/min_by workarounds - patterns that materialize the full 
>>>>> Cartesian
>>>>> product and give the optimizer no semantic signal for specialized 
>>>>> execution
>>>>> strategies.
>>>>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL
>>>>> with pgvector) all provide dedicated primitives for this. Spark currently
>>>>> does not.
>>>>>
>>>>> *Proposal*
>>>>> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST
>>>>> ... BY clause for top-K ranking joins. The BY expression is pluggable -
>>>>> vector similarity, geometric distance, BM25, or any composite scoring
>>>>> expression - making the same syntax usable across vector search,
>>>>> geospatial, and text retrieval use cases. The APPROX / EXACT keywords make
>>>>> the search algorithm contract explicit, ensuring future index creation or
>>>>> deletion never silently changes query results.
>>>>>
>>>>> The initial scope covers SQL syntax, brute-force exact execution
>>>>> (rewritten into existing physical operators: JOIN + max_by/min_by with K
>>>>> overload), and Spark Connect / PySpark API support. Vector index DDL and
>>>>> indexed ANN execution are deferred as future work.
>>>>>
>>>>> *Example SQL*:
>>>>>
>>>>> sql
>>>>> -- Batch vector search: find the 10 most similar products for each user
>>>>> SELECT q.user_id, t.*
>>>>> FROM users q
>>>>> INNER JOIN products t
>>>>>   APPROX NEAREST 10 BY SIMILARITY
>>>>> vector_cosine_similarity(q.embedding, t.embedding)
>>>>>
>>>>> *Relevant Links*
>>>>>
>>>>> SPIP Document:
>>>>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7
>>>>> Discussion Thread:
>>>>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm
>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395
>>>>>
>>>>> The vote will be open for at least 72 hours.
>>>>> Please vote:
>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>> [ ] +0
>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>> Cheers,
>>>>>
>>>>> Zhidong (Zero) Qu
>>>>> Software Engineer
>>>>> AI System
>>>>>
>>>>>

Re: [VOTE] SPIP: NEAREST BY Top-K Ranking Join

Reply via email to