Hi all, I'd like to start a discussion on a new SPIP to add a NEAREST BY top-K ranking join to Spark SQL.
- SPIP Document: https://docs.google.com/document/d/1opFVcQJgEWDWUVB7uVlFMlNomRwxqRu8iW0JmvCvxF0/edit?tab=t.0#heading=h.hf633coi8nc7 - JIRA: https://issues.apache.org/jira/browse/SPARK-56395 Motivation Today, finding the top-K closest matches between two tables in Spark SQL requires verbose CROSS JOIN + window function patterns or Spark-specific max_by/struct/inline idioms. Both are error-prone and opaque to the optimizer - there is no way to transparently benefit from future index-based approximate search without rewriting the query. Proposal This SPIP extends JOIN syntax with a NEAREST ... BY clause that expresses top-K ranking joins directly: SELECT q.user_id, t.* FROM users q JOIN products t APPROX NEAREST 10 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding) The BY clause accepts any scalar expression, so the same syntax works for vector similarity, geospatial distance, BM25 text relevance, or composite scoring. APPROX / EXACT keywords make the search algorithm contract explicit - index creation never silently changes query results. The initial scope covers SQL parsing, brute-force execution, INNER and LEFT OUTER join modes. Index DDL and ANN acceleration are planned as future work; the syntax is designed to accommodate them transparently. Please review the full SPIP for the technical details. Looking forward to your feedback and discussion! Best regards, Zhidong (Zero) Qu Software Engineer AI System
