ishnagy opened a new pull request, #50933: URL: https://github.com/apache/spark/pull/50933
### What changes were proposed in this pull request? This change fixes a performance degradation issue in the current BloomFilter implementation. The current bit index calculation logic does not use any part of the indexable space above the first 31bits, so when the inserted item count approaches (or exceeds) Integer.MAX_VALUE, it will produce significantly worse collision rates than an (ideal) uniformly distributing hash function. ### Why are the changes needed? This should qualify as a bug. The upper bound on the bit capacity of the current BloomFilter implementation in spark is approx 137G bits (64 bit longs in an Integer.MAX_VALUE sized array). The current indexing scheme can only address about 2G bits of these. On the other hand, due to the way the BloomFilters are used, the bug won't cause any logical errors, it will gradually render the BloomFilter instance useless by forcing more-and-more queries on the slow path. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? #### new test One new java testclass was added to `sketch` to test different combinations of item counts and expected fpp rates. ``` common/sketch/src/test/java/org/apache/spark/util/sketch/TestSparkBloomFilter.java ``` `testAccuracyEvenOdd` in N number of iterations inserts N even numbers (2\*i), and leaves out N odd numbers (2\*i+1) from the BloomFilter. It measures the false positve rate on the not-inserted odd numbers. `testAccuracyRandom` in 2N number of iterations inserts N pseudorandomly generated numbers in two differently seeded (theoretically independent) BloomFilter instances. It measures the false positive rate of the primary instance over the true-negative set of the secondary instance. #### patched One minor (test) issue was fixed in ``` common/sketch/src/test/scala/org/apache/spark/util/sketch/BloomFilterSuite.scala ``` where the potential repetitions in the randomly generated stream of insertable items resulted in slightly worse fpp measurements than the actual. The problem affected the those testcases more where the cardinality of the tested type is low (the chance of repetition is high). #### removed from the default runs Running these test as part of the default build process was turned off with adding `@Disabled` annotation to the new testclass. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
