Hi Josh, all I'm sitting at an airport, so rather than participating in the comment threads in the doc, I will just post some high level principles I've derived during my own long career in performance testing.
Infra: - It's a common myth that you need to use on premise HW because cloud HW is noisy. - Most likely the opposite is true: A small cluster of lab hardware runs the risk of some sysadmin with root access manually modifying the servers and leave them in an inconsistent configuration. Otoh a public cloud is configured with infrastructure as code, so every change is tracked in version control. - Four part article on how we tuned EC2 at my previous employer: 1 <https://www.mongodb.com/blog/post/reducing-variability-performance-tests-ec2-setup-key-results>, 2 <https://www.mongodb.com/blog/post/repeatable-performance-tests-ec2-instances-neither-good-nor-bad>, 3 <https://www.mongodb.com/blog/post/repeadtable-performance-tests-ebs-instances-stable-option> , 4 <https://www.mongodb.com/blog/post/repeatable-performance-tests-cpu-options-best-disabled> . - Trust no one, measure everything. For example, don't trust that what I'm writing here is true. Run sysbench against your HW, then you have first hand observations. - Specifically using EC2 has an additional benefit that the instance types can be considered well known and standard HW configurations more than any on premise system. Performance testing is regression testing - Important: Run perf tests with the nightly build. Make sure your HW configuration is repeatable and low variability from day to day. - Less important / later: - Using complciated benchmarks (tpcc...) that try to model a real world app. These can take weeks to develop, each. - Having lots of different benchmarks for "coverage". - Adding the above two together: Running a simple key-value test (e.g. YCSB) every night in an automated and repeatable way, and storing the result - whatever is considered relevant - so that you end up with a timeseries is a great start and I'd take this over that complicated "representative" benchmark any day. - Use change detection to automatically and deterministically flag statistically significant change points (regressions). - Literature: detecting-performance-regressions-with-datastax-hunter <https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4> , - Literature: Fallout: Distributed Systems Testing as a Service <https://www.semanticscholar.org/paper/0cebbfebeab6513e98ad1646cc795cabd5ddad8a> Automated system performance testing at MongoDB <https://www.connectedpapers.com/main/0cebbfebeab6513e98ad1646cc795cabd5ddad8a/graph> Common gotchas: - Testing with a small data set that fits entirely in RAM. A good dataset is 5x the RAM available to the DB process. Or you just test with the size a real production server would be running - at Datastax we have tests that use a 1TB and 1.5TB data set, because those tend to be standard maximum sizes (per node) at customers. - The test runtime is too short. IT depends on the database what is a good test duration. The goal is to reach stable state. But for an LSM database like Cassandra this can be hard. For other databases I worked with, the default is typically to flush every 15 to 60 seconds, and the test duration should be a multiple of those (3 to 5 min). - Naive comparisons to determine whether a test result is a regression or not. For example benchmarking the new release against the stable version, one run each, and reporting the result as "fact". Or comparing today's result with yesterday's. ' Building perf testing systems following the above principles have had a lot of positive impact in my projects. For example, at my previous employer we caught 17 significant regressions during the 1 year long development cycle of the next major version. (see my paper above) Otoh after the GA release, during the next year users only reported 1 significant performance regression. That is to say, the perf testing of nightly builds caught all but one regressions in the new major version. henrik On Fri, Dec 30, 2022 at 7:41 AM Josh McKenzie <jmcken...@apache.org> wrote: > There was a really interesting presentation from the Lucene folks at > ApacheCon about how they're doing perf regression testing. That combined > with some recent contributors wanting to get involved on some performance > work and not having much direction or clarity on how to get involved led > some of us to come together and riff on what we might be able to take away > from that presentation and context. > > Lucene presentation: "Learning from 11+ years of Apache Lucene > benchmarks": > https://docs.google.com/presentation/d/1Tix2g7W5YoSFK8jRNULxOtqGQTdwQH3dpuBf4Kp4ouY/edit#slide=id.p > > Their nightly indexing benchmark site: > https://home.apache.org/~mikemccand/lucenebench/indexing.html > > I've checked in with a handful of performance minded contributors in early > December and we came up with a first draft, then some others of us met on > an adhoc call on the 12/9 (which was recorded; ping on this thread if you'd > like that linked - I believe Joey Lynch has that). > > Here's where we landed after the discussions earlier this month (1st page, > estimated reading time 5 minutes): > https://docs.google.com/document/d/1X5C0dQdl6-oGRr9mXVPwAJTPjkS8lyt2Iz3hWTI4yIk/edit# > > Curious to hear what other perspectives there are out there on the topic. > > Early Happy New Years everyone! > > ~Josh > > > -- Henrik Ingo +358 40 569 7354 <358405697354> [image: Visit us online.] <https://www.datastax.com/> [image: Visit us on Twitter.] <https://twitter.com/DataStaxEng> [image: Visit us on YouTube.] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=> [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>