Durgaprasad M L created SPARK-57319:
---------------------------------------

             Summary: Rename misleading approx_top_k terminology to 
approx_frequent_items
                 Key: SPARK-57319
                 URL: https://issues.apache.org/jira/browse/SPARK-57319
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.1.1
            Reporter: Durgaprasad M L


The current approx_top_k naming in Spark is misleading because the underlying 
implementation is based on Apache DataSketches Frequent Items sketches, which 
do not provide strict top-k guarantees.

Instead, the sketch identifies frequent items / heavy hitters using 
threshold-oriented probabilistic guarantees and may legitimately return fewer 
than k items or no items at all depending on stream distribution and sketch 
configuration.

This improvement proposes:
- renaming approx_top_k terminology to approx_frequent_items
- aligning terminology with Apache DataSketches documentation
- preserving backward compatibility through deprecated aliases
- updating Scala, PySpark, Spark Connect APIs, docs, and test suites

Related PR:
https://github.com/apache/spark/pull/56333



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to