ishnagy commented on code in PR #50933:
URL: https://github.com/apache/spark/pull/50933#discussion_r2116047515


##########
common/sketch/src/test/java/org/apache/spark/util/sketch/TestSparkBloomFilter.java:
##########
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.sketch;
+
+import org.junit.jupiter.api.*;
+import org.junitpioneer.jupiter.cartesian.CartesianTest;
+import org.junitpioneer.jupiter.cartesian.CartesianTest.Values;
+
+import java.io.PrintStream;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.time.Duration;
+import java.time.Instant;
+import java.util.Map;
+import java.util.Random;
+import java.util.concurrent.ConcurrentHashMap;
+
+
+@Disabled

Review Comment:
   > Submitting a test case and directly disabling it is not an ideal approach. 
Why can't it undergo regular validation through GitHub Actions?
   I agree, in spirit, this test code I submitted is much more close to a 
benchmark (measurement rather than validation) than to an actual test case, 
with the emphasis on expectations and assertions.
   
   I wasn't aware of the benchmark workflow, I will have a look whether I can 
fit this logic in there. Not sure, if it will be a snap in fit, because the 
code focuses on obtaining a Bloom filter specific measure, the false pos rate, 
and some more generic measures like running time or resource consumption.
   
   Moreover, the performance gains won't be directly apparent on the sketch 
level. If anything, it will have a slightly worse running time, but shouldn't 
consume more mem than the previous logic. The gains should only be measurable 
in client code (like sql) that uses the sketch implementation. E.g. with a 
reasonably low error rate in the implementation won't force almost any queried 
element on the slow path when the filter is saturated.
   
   > Or can the scenarios in 
`org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark` reflect the 
optimizations brought about by the current pr?
   
   This may or may not be measurable with the current benchmarks, I haven't 
looked into that yet. As a rule of thumb, in the current implementation, after 
a few hundred million elements the false pos rate gets noticeably (a few 
percents) higher than expected, around about a billion (2^30) it diverges 
significantly (a few tens of percents), and above 2G (2^31) items, it gets 
prohibitively high (90%+). With the proposed new logic the error rate remains 
within a few percents off of the expected on all scales.
   
   If the current benchmarks already use Bloom filters with more than a few 
hundred million items inserted, then the  performance improvements should be 
visible there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to