[PR] [SPARK-55655][MLLIB] Make `CountVectorizer` vocabulary deterministic when counts are equal [spark]

via GitHub Tue, 24 Feb 2026 03:32:31 -0800


LuciferYang opened a new pull request, #54446:
URL: https://github.com/apache/spark/pull/54446


   ### What changes were proposed in this pull request?
   This pr fix `CountVectorizer` to use a deterministic ordering when selecting 
the top vocabulary terms. Specifically, when two terms have the same frequency 
(count), they are now sorted by the term itself (lexicographically) as a 
tie-breaker.
   
   
   ### Why are the changes needed?
   Currently, `CountVectorizer` uses `wordCounts.top(...)(Ordering.by(_._2))` 
to select the vocabulary. This comparison only considers term counts. When 
multiple terms have the same count, the resulting order in the vocabulary is 
non-deterministic and depends on the RDD partition processing order or the 
iteration order of the internal hash maps.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   - Pass Github Actions
   - Added a new test case in `CountVectorizerSuite` that intentionally creates 
a dataset with tied term counts and asserts a specific, deterministic 
vocabulary order.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-55655][MLLIB] Make `CountVectorizer` vocabulary deterministic when counts are equal [spark]

Reply via email to