Re: [VOTE] SPIP: NEAREST BY Top-K Ranking Join

Peter Toth Wed, 29 Apr 2026 08:22:55 -0700

+1 (non-binding)

On Wed, Apr 29, 2026 at 4:33 PM Antônio Marcos Souza Pereira <
[email protected]> wrote:


> +1
>
>
> On Tue, Apr 28, 2026 at 9:03 PM huaxin gao <[email protected]> wrote:
>
>> +1
>>
>> On Tue, Apr 28, 2026 at 4:49 PM Wenchen Fan <[email protected]> wrote:
>>
>>> +1
>>>
>>> On Tue, Apr 28, 2026 at 1:04 PM Zero Qu via dev <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*.
>>>>
>>>> *Motivation*
>>>> Top-K nearest neighbor search is a fundamental building block for
>>>> semantic search, retrieval-augmented generation (RAG), recommendation
>>>> systems, and geospatial nearest-neighbor queries. Today, Spark SQL users
>>>> have to express this pattern through verbose CROSS JOIN + window function
>>>> or max_by/min_by workarounds - patterns that materialize the full Cartesian
>>>> product and give the optimizer no semantic signal for specialized execution
>>>> strategies.
>>>> Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL
>>>> with pgvector) all provide dedicated primitives for this. Spark currently
>>>> does not.
>>>>
>>>> *Proposal*
>>>> This SPIP proposes extending standard SQL JOIN syntax with a NEAREST
>>>> ... BY clause for top-K ranking joins. The BY expression is pluggable -
>>>> vector similarity, geometric distance, BM25, or any composite scoring
>>>> expression - making the same syntax usable across vector search,
>>>> geospatial, and text retrieval use cases. The APPROX / EXACT keywords make
>>>> the search algorithm contract explicit, ensuring future index creation or
>>>> deletion never silently changes query results.
>>>>
>>>> The initial scope covers SQL syntax, brute-force exact execution
>>>> (rewritten into existing physical operators: JOIN + max_by/min_by with K
>>>> overload), and Spark Connect / PySpark API support. Vector index DDL and
>>>> indexed ANN execution are deferred as future work.
>>>>
>>>> *Example SQL*:
>>>>
>>>> sql
>>>> -- Batch vector search: find the 10 most similar products for each user
>>>> SELECT q.user_id, t.*
>>>> FROM users q
>>>> INNER JOIN products t
>>>>   APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
>>>> t.embedding)
>>>>
>>>> *Relevant Links*
>>>>
>>>> SPIP Document:
>>>> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7
>>>> Discussion Thread:
>>>> https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm
>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-56395
>>>>
>>>> The vote will be open for at least 72 hours.
>>>> Please vote:
>>>> [ ] +1: Accept the proposal as an official SPIP
>>>> [ ] +0
>>>> [ ] -1: I don't think this is a good idea because ...
>>>> Cheers,
>>>>
>>>> Zhidong (Zero) Qu
>>>> Software Engineer
>>>> AI System
>>>>
>>>>

Re: [VOTE] SPIP: NEAREST BY Top-K Ranking Join

Reply via email to