ishnagy commented on code in PR #50933: URL: https://github.com/apache/spark/pull/50933#discussion_r2116047515
########## common/sketch/src/test/java/org/apache/spark/util/sketch/TestSparkBloomFilter.java: ########## @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +import org.junit.jupiter.api.*; +import org.junitpioneer.jupiter.cartesian.CartesianTest; +import org.junitpioneer.jupiter.cartesian.CartesianTest.Values; + +import java.io.PrintStream; +import java.nio.file.Files; +import java.nio.file.Path; +import java.time.Duration; +import java.time.Instant; +import java.util.Map; +import java.util.Random; +import java.util.concurrent.ConcurrentHashMap; + + +@Disabled Review Comment: > Submitting a test case and directly disabling it is not an ideal approach. Why can't it undergo regular validation through GitHub Actions? I agree, in spirit, this test code I submitted is much more close to a benchmark (measurement rather than validation) than to an actual test case, with the emphasis on expectations and assertions. I wasn't aware of the benchmark workflow, I will have a look whether I can fit this logic in there. Not sure, if it will be a snap in fit, because the code focuses on obtaining a Bloom filter specific measure, the false pos rate, and some more generic measures like running time or resource consumption. Moreover, the performance gains won't be directly apparent on the sketch level. If anything, it will have a slightly worse running time, but shouldn't consume more mem than the previous logic. The gains should only be measurable in client code (like sql) that uses the sketch implementation. E.g. with a reasonably low error rate in the implementation won't force almost any queried element on the slow path when the filter is saturated. > Or can the scenarios in `org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark` reflect the optimizations brought about by the current pr? This may or may not be measurable with the current benchmarks, I haven't looked into that yet. As a rule of thumb, in the current implementation, after a few hundred million elements the false pos rate gets noticeably (a few percents) higher than expected, around about a billion (2^30) it diverges significantly (a few tens of percents), and above 2G (2^31) items, it gets prohibitively high (90%+). With the proposed new logic the error rate remains within a few percents off of the expected on all scales. If the current benchmarks already use Bloom filters with more than a few hundred million items inserted, then the performance improvements should be visible there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org