LuciferYang opened a new pull request, #54446: URL: https://github.com/apache/spark/pull/54446
### What changes were proposed in this pull request? This pr fix `CountVectorizer` to use a deterministic ordering when selecting the top vocabulary terms. Specifically, when two terms have the same frequency (count), they are now sorted by the term itself (lexicographically) as a tie-breaker. ### Why are the changes needed? Currently, `CountVectorizer` uses `wordCounts.top(...)(Ordering.by(_._2))` to select the vocabulary. This comparison only considers term counts. When multiple terms have the same count, the resulting order in the vocabulary is non-deterministic and depends on the RDD partition processing order or the iteration order of the internal hash maps. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass Github Actions - Added a new test case in `CountVectorizerSuite` that intentionally creates a dataset with tied term counts and asserts a specific, deterministic vocabulary order. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
