Hi all,

I'd like to start a discussion on a new SPIP to add a NEAREST BY top-K
ranking join to Spark SQL.

- SPIP Document:
https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7


- JIRA:
https://issues.apache.org/jira/browse/SPARK-56395

Motivation

Today, finding the top-K closest matches between two tables in Spark SQL
requires verbose CROSS JOIN + window function patterns or Spark-specific
max_by/struct/inline idioms. Both are error-prone and opaque to the
optimizer - there is no way to transparently benefit from future
index-based approximate search without rewriting the query.

Proposal

This SPIP extends JOIN syntax with a NEAREST ... BY clause that expresses
top-K ranking joins directly:

SELECT q.user_id, t.*
FROM users q
JOIN products t
  APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
t.embedding)

The BY clause accepts any scalar expression, so the same syntax works for
vector similarity, geospatial distance, BM25 text relevance, or composite
scoring. APPROX / EXACT keywords make the search algorithm contract
explicit - index creation never silently changes query results.

The initial scope covers SQL parsing, brute-force execution, INNER and LEFT
OUTER join modes. Index DDL and ANN acceleration are planned as future
work; the syntax is designed to accommodate them transparently.

Please review the full SPIP for the technical details. Looking forward to
your feedback and discussion!

Best regards,
Zhidong (Zero) Qu
Software Engineer
AI System

Reply via email to