ishnagy opened a new pull request, #50933:
URL: https://github.com/apache/spark/pull/50933

   ### What changes were proposed in this pull request?
   
   This change fixes a performance degradation issue in the current BloomFilter 
implementation.
   
   The current bit index calculation logic does not use any part of the 
indexable space above the first 31bits, so when the inserted item count 
approaches (or exceeds) Integer.MAX_VALUE, it will produce significantly worse 
collision rates than an (ideal) uniformly distributing hash function.
   
   
   ### Why are the changes needed?
   
   This should qualify as a bug. 
   
   The upper bound on the bit capacity of the current BloomFilter 
implementation in spark is approx 137G bits (64 bit longs in an 
Integer.MAX_VALUE sized array). The current indexing scheme can only address 
about 2G bits of these.
   
   On the other hand, due to the way the BloomFilters are used, the bug won't 
cause any logical errors, it will gradually render the BloomFilter instance 
useless by forcing more-and-more queries on the slow path.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   
   ### How was this patch tested?
   
   #### new test
   
   One new java testclass was added to `sketch` to test different combinations 
of item counts and expected fpp rates.
   ```
   
common/sketch/src/test/java/org/apache/spark/util/sketch/TestSparkBloomFilter.java
   ```
   
   `testAccuracyEvenOdd`
   in N number of iterations inserts N even numbers (2\*i), and leaves out N 
odd numbers (2\*i+1) from the BloomFilter. It measures the false positve rate 
on the not-inserted odd numbers.
   
   `testAccuracyRandom`
   in 2N number of iterations inserts N pseudorandomly generated numbers in two 
differently seeded (theoretically independent) BloomFilter instances. It 
measures the false positive rate of the primary instance over the true-negative 
set 
   of the secondary instance.
   
   #### patched
   
   One minor (test) issue was fixed in
   ```
   
common/sketch/src/test/scala/org/apache/spark/util/sketch/BloomFilterSuite.scala
   ```
   where the potential repetitions in the randomly generated stream of 
insertable items resulted in slightly worse fpp measurements than the actual. 
The problem affected the those testcases more where the cardinality of the 
tested type is low (the chance of repetition is high). 
   
   #### removed from the default runs
   Running these test as part of the default build process was turned off with 
adding `@Disabled` annotation to the new testclass.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to