Hi Zhidong,

Thanks for putting this SPIP together. The NEAREST ... BY syntax is much
cleaner than the CROSS JOIN + window pattern, and the explicit APPROX/EXACT
contract is a good call.

I left a couple of comments on the doc around NULL and tie semantics.
Thanks for the quick replies: inheriting max_by/min_by behavior makes sense.

One thing I wanted to add: the APPROX/EXACT split leaves good room for
index-based acceleration. On the Iceberg side, we've been discussing vector
index support as part of the secondary index work. When that is ready,
queries using APPROX can use the index automatically, without any rewrite.
Looking forward to seeing these efforts converge.

Thanks,

Huaxin

On Tue, Apr 14, 2026 at 3:59 PM Zero Qu via dev <[email protected]>
wrote:

> Hi all,
>
> I'd like to start a discussion on a new SPIP to add a NEAREST BY top-K
> ranking join to Spark SQL.
>
> - SPIP Document:
>
> https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7
>
>
> - JIRA:
> https://issues.apache.org/jira/browse/SPARK-56395
>
> Motivation
>
> Today, finding the top-K closest matches between two tables in Spark SQL
> requires verbose CROSS JOIN + window function patterns or Spark-specific
> max_by/struct/inline idioms. Both are error-prone and opaque to the
> optimizer - there is no way to transparently benefit from future
> index-based approximate search without rewriting the query.
>
> Proposal
>
> This SPIP extends JOIN syntax with a NEAREST ... BY clause that expresses
> top-K ranking joins directly:
>
> SELECT q.user_id, t.*
> FROM users q
> JOIN products t
>   APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding,
> t.embedding)
>
> The BY clause accepts any scalar expression, so the same syntax works for
> vector similarity, geospatial distance, BM25 text relevance, or composite
> scoring. APPROX / EXACT keywords make the search algorithm contract
> explicit - index creation never silently changes query results.
>
> The initial scope covers SQL parsing, brute-force execution, INNER and
> LEFT OUTER join modes. Index DDL and ANN acceleration are planned as future
> work; the syntax is designed to accommodate them transparently.
>
> Please review the full SPIP for the technical details. Looking forward to
> your feedback and discussion!
>
> Best regards,
> Zhidong (Zero) Qu
> Software Engineer
> AI System
>
>

Reply via email to