Hi all,

I'd like to call a vote on the *SPIP: NEAREST BY Top-K Ranking Join*.

*Motivation*
Top-K nearest neighbor search is a fundamental building block for semantic
search, retrieval-augmented generation (RAG), recommendation systems, and
geospatial nearest-neighbor queries. Today, Spark SQL users have to express
this pattern through verbose CROSS JOIN + window function or max_by/min_by
workarounds - patterns that materialize the full Cartesian product and give
the optimizer no semantic signal for specialized execution strategies.
Competing systems (BigQuery, SQL Server 2025, Snowflake, PostgreSQL with
pgvector) all provide dedicated primitives for this. Spark currently does
not.

*Proposal*
This SPIP proposes extending standard SQL JOIN syntax with a NEAREST ... BY
clause for top-K ranking joins. The BY expression is pluggable - vector
similarity, geometric distance, BM25, or any composite scoring expression -
making the same syntax usable across vector search, geospatial, and text
retrieval use cases. The APPROX / EXACT keywords make the search algorithm
contract explicit, ensuring future index creation or deletion never
silently changes query results.

The initial scope covers SQL syntax, brute-force exact execution (rewritten
into existing physical operators: JOIN + max_by/min_by with K overload),
and Spark Connect / PySpark API support. Vector index DDL and indexed ANN
execution are deferred as future work.

*Example SQL*:

sql
-- Batch vector search: find the 10 most similar products for each user
SELECT q.user_id, t.*
FROM users q
INNER JOIN products t
  APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
t.embedding)

*Relevant Links*

SPIP Document:
https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7
Discussion Thread:
https://lists.apache.org/thread/zg8nk236g9f4lg6d2tm6s3xh0cfhg4hm
JIRA: https://issues.apache.org/jira/browse/SPARK-56395

The vote will be open for at least 72 hours.
Please vote:
[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...
Cheers,

Zhidong (Zero) Qu
Software Engineer
AI System

Reply via email to