Absolutely, happy to share. All tests were done using easy-cass-stress v9 and easy-cass-lab, with the latest released 5.0 (not including 15452 or 20092). Instructions at the end.
> Regarding allocation rate vs throughput, unfortunately allocation rate vs throughput are not connected linearly, Yes, agreed, they're not linearly related. However, allocation rate does correlate linearly to GC pause frequency, and does increase GC pause time. When you increase your write throughput, you put more pressure on compaction. In order to keep up, you need to increase compaction throughput. This leads to excess allocation, and the longer pauses. For teams with a low SLO (say10ms p99), compaction allocation becomes one of the factors that prevent them from increasing node density due to it's effect on GC pause times. Reducing the allocation rate will allow for much faster compaction with less impact on GC. > So, while I agree that the mentioned compaction logic (cells deserializing) is a subject to improve from an allocation point of view I am not sure if we get dramatic improvements in throughput just because of reducing it.. I am _quite_ confident that reducing the total allocation in Cassandra by almost 50% we will see a _significant_ performance improvement, but obviously we need hard numbers, not just my gut feelings and unbridled confidence. I'll have to dig up the profile, I'm switching between a bunch of tests and sadly I didn't label all of them, I collected quite a few. The % number I referenced was from a different load test that I looked up several days ago earlier in the thread, and I have several hundred of them hanging around. Here's the process of setting up the cluster with easy-cass-lab (I have ecl aliased to easy-cass-lab on my laptop): mkdir test cd test ecl init -i r5d.2xlarge -c 3 -s 1 test ecl up ecl use 5.0 cat <<'EOF' >> cassandra.patch.yaml memtable: configurations: skiplist: class_name: SkipListMemtable trie: class_name: TrieMemtable default: inherits: trie memtable_offheap_space: 8GiB memtable_allocation_type: offheap_objects EOF Then apply these JVM settings to the jvm.options file in the local dir: ### G1 Settings ## Use the Hotspot garbage-first collector. -XX:+UseG1GC -XX:+ParallelRefProcEnabled -XX:MaxTenuringThreshold=2 -XX:G1HeapRegionSize=16m -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=50 -Xms30G -Xmx30G # ## Have the JVM do less remembered set work during STW, instead ## preferring concurrent GC. Reduces p99.9 latency. -XX:G1RSetUpdatingPauseTimePercent=5 # ## Main G1GC tunable: lowering the pause target will lower throughput and vise versa. ## 200ms is the JVM default and lowest viable setting ## 1000ms increases throughput. Keep it smaller than the timeouts in cassandra.yaml. -XX:MaxGCPauseMillis=200 Then have it update the configs and start the cluster: ecl uc ecl start source env.sh You can disable compaction on a one node: c0 nodetool disableautocompaction Connect to the stress instance using the shortcut defined in env.sh: s0 Running the stress workload is best done with shenandoah and java 17 to avoid long pauses: sudo update-java-alternatives -s java-1.17.0-openjdk-amd64 export EASY_CASS_STRESS_OPTS="-XX:+UseShenandoahGC" Here's a workload that's writes only, very small values: easy-cass-stress run KeyValue -d 1h --field.keyvalue.value='random(4,8)' --maxwlat 50 --rate 200k -r 0 Let that ramp up for a bit. Then back in your local dir, (make sure you source env.sh first) cflame cassandra0 It'll take a profile and run for a minute. You can also get an allocation profile by doing this: cflame cassandra0 -e alloc Feel free to ping me directly with questions. Jon On Tue, Mar 11, 2025 at 3:20 PM Dmitry Konstantinov <netud...@gmail.com> wrote: > Jon, thank you for testing!, can you share your CPU profile and test load > details? Have you tested it with CASSANDRA-20092 changes included? > > >> Allocations related to codahale were < 1%. > Just to clarify: in the initial mail by memory footprint I mean the static > amount of memory used to store metric objects, not a dynamic allocation > during requests processing (it should be almost zero and not a target > to optimize). > > >> Once compaction is enabled, it's in the 2-3% realm > What percent of CPU profile do you have spent for compaction in your load? > (to dilute 7-8% to 2-3% it should be around 50%.., because compaction does > not change the ratio between between total efforts spent for request > processing vs metrics part of it) > > Regarding allocation rate vs throughput, unfortunately allocation rate vs > throughput are not connected linearly, for example here: > https://issues.apache.org/jira/browse/CASSANDRA-20165 I reduced > allocation almost 2 times and got about 8% improvement in throughput (which > is still a good result). > So, while I agree that the mentioned compaction logic (cells > deserializing) is a subject to improve from an allocation point of view I > am not sure if we get dramatic improvements in throughput just because of > reducing it.. > > Regarding the metric registry - yes, I do not see a reason to move away > from it, in any case we need a common place to access metrics to provide > the correspondent virtual tables at least. > Regarding docs - I like this. I actually did something similar in one of > my non-open source projects by adding a description to each metric to be > able to render docs + to validate during a build that adding metrics are > properly documented. > > > > On Tue, 11 Mar 2025 at 17:56, Jon Haddad <j...@rustyrazorblade.com> wrote: > >> Definitely +1 on registry + docs. I believe that's part of the OTel Java >> SDK [1][2] >> >> I did some performance testing yesterday and was able to replicate the >> findings where the codahale code path took 7-10% of CPU time. The only >> caveat is that it only happens with compaction disabled. Once compaction >> is enabled, it's in the 2-3% realm. Allocations related to codahale were < >> 1%. >> >> I'm not discouraging anyone from pursuing performance optimizations, just >> looking to set expectations on what the real world benefits will be. This >> will likely yield a ~ 2% improvement in throughput based on the earlier >> discussion. >> >> For comparison, eliminating a single byte buffer allocation in >> ByteArrayAccessor.read in the BTree code path would reduce heap allocations >> by 40%, with default compaction throughput of 64MB/s. Addressing this, in >> conjunction with the recently merged CASSANDRA-15452 + Branimir's >> CASSANDRA-20092 patch, would allow for much faster compaction which in >> turn, would improve density, and significantly reduce latency. If you're >> chasing perf issues, this is one of the top problems in the codebase. >> >> <steps back from the bait> >> >> Jon >> >> [1] https://opentelemetry.io/docs/languages/java/api/#meterprovider >> [2] https://opentelemetry.io/docs/specs/semconv/attributes-registry/ >> >> >> >> On Tue, Mar 11, 2025 at 8:02 AM Josh McKenzie <jmcken...@apache.org> >> wrote: >> >>> Having something like a registry and standardizing/enforcing all metric >>> types is something we should be sure to maintain. >>> >>> A registry w/documentation on each metric indicating *what it's >>> actually measuring and what it means* would be great for our users. >>> >>> On Mon, Mar 10, 2025, at 3:46 PM, Chris Lohfink wrote: >>> >>> Just something to be mindful about what we had *before* codahale in >>> Cassandra and avoid that again. Pre 1.1 it was pretty much impossible to >>> collect metrics without looking at code (there were efficient custom made >>> things, but each metric was reported differently) and that stuck through >>> until 2.2 days. Having something like a registry and >>> standardizing/enforcing all metric types is something we should be sure to >>> maintain. >>> >>> Chris >>> >>> On Fri, Mar 7, 2025 at 1:33 PM Jon Haddad <j...@rustyrazorblade.com> >>> wrote: >>> >>> As long as operators are able to use all the OTel tooling, I'm happy. >>> I'm not looking to try to decide what the metrics API looks like, although >>> I think trying to plan for 15 years out is a bit unnecessary. A lot of the >>> DB will be replaced by then. That said, I'm mostly hands off on code and >>> you guys are more than capable of making the smart decision here. >>> >>> Regarding virtual tables, I'm looking at writing a custom OTel receiver >>> [1] to ingest them. I was really impressed with the performance work you >>> did there and it got my wheels turning on how to best make use of it. I am >>> planning on using it with easy-cass-lab to pull DB metrics and logs down to >>> my local machine along with kernel metrics via eBPF. >>> >>> Jon >>> >>> [1] https://opentelemetry.io/docs/collector/building/receiver/ >>> >>> >>> >>> On Wed, Mar 5, 2025 at 1:06 PM Maxim Muzafarov <mmu...@apache.org> >>> wrote: >>> >>> If we do swap, we may run into the same issues with third-party >>> metrics libraries in the next 10-15 years that we are discussing now >>> with the Codahale we added ~10-15 years ago, and given the fact that a >>> proposed new API is quite small my personal feeling is that it would >>> be our best choice for the metrics. >>> >>> Having our own API also doesn't prevent us from having all the >>> integrations with new 3-rd party libraries the world will develop in >>> future, just by writing custom adapters to our own -- this will be >>> possible for the Codahale (with some suboptimal considerations), where >>> we have to support backwards compatibility, and for the OpenTelemetry >>> as well. We already have the CEP-32[1] proposal to instrument metrics; >>> in this sense, it doesn't change much for us. >>> >>> Another point of having our own API is the virtual tables we have -- >>> it gives us enough flexibility and latitude to export the metrics >>> efficiently via the virtual tables by implementing the access patterns >>> we consider important. >>> >>> [1] >>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=255071749#CEP32:(DRAFT)OpenTelemetryintegration-ExportingMetricsthroughOpenTelemetry >>> [2 https://opentelemetry.io/docs/languages/java/instrumentation/ >>> >>> On Wed, 5 Mar 2025 at 21:35, Jeff Jirsa <jji...@gmail.com> wrote: >>> > >>> > I think widely accepted that otel in general has won this stage of >>> observability, as most metrics systems allow it and most saas providers >>> support it. So Jon’s point there is important. >>> > >>> > The promise of unifying logs/traces/metrics usually (aka wide events) >>> is far more important in the tracing side of our observability than in the >>> areas we use Codahale/DropWizard. >>> > >>> > Scott: if we swap, we can (probably should) deprecate like everything >>> else, and run both side by side for a release so people don’t lose metrics >>> entirely on bounce? FF both, to control double cost during the transition. >>> > >>> > >>> > >>> > >>> > On Mar 5, 2025, at 8:21 PM, C. Scott Andreas <sc...@paradoxica.net> >>> wrote: >>> > >>> > No strong opinion on particular choice of metrics library. >>> > >>> > My primary feedback is that if we swap metrics implementations and the >>> new values are *different*, we can anticipate broad user confusion/interest. >>> > >>> > In particular if latency stats are reported higher post-upgrade, we >>> should expect users to interpret this as a performance regression, >>> dedicating significant resources to investigating the change, and expending >>> credibility with stakeholders in their systems. >>> > >>> > - Scott >>> > >>> > On Mar 5, 2025, at 11:57 AM, Benedict <bened...@apache.org> wrote: >>> > >>> > >>> > I really like the idea of integrating tracing, metrics and logging >>> frameworks. >>> > >>> > I would like to have the time to look closely at the API before we >>> decide to adopt it though. I agree that a widely deployed API has inherent >>> benefits, but any API we adopt also shapes future evolution of our >>> capabilities. Hopefully this is also a good API that allows us plenty of >>> evolutionary headroom. >>> > >>> > >>> > On 5 Mar 2025, at 19:45, Josh McKenzie <jmcken...@apache.org> wrote: >>> > >>> > >>> > >>> > if the plan is to rip out something old and unmaintained and replace >>> with something new, I think there's a huge win to be had by implementing >>> the standard that everyone's using now. >>> > >>> > Strong +1 on anything that's an ecosystem integration inflection >>> point. The added benefit here is that if we architect ourselves to >>> gracefully integrate with whatever system's are ubiquitous today, we'll >>> inherit the migration work that any new industry-wide replacement system >>> would need to do to become the new de facto standard. >>> > >>> > On Wed, Mar 5, 2025, at 2:23 PM, Jon Haddad wrote: >>> > >>> > Thank you for the replies. >>> > >>> > Dmitry: Based on some other patches you've worked on and your >>> explanation here, it looks like you're optimizing the front door portion of >>> write path - very cool. Testing it in isolation with those settings makes >>> sense if your goal is to push write throughput as far as you can, something >>> I'm very much on board with, and is a key component to pushing density and >>> reducing cost. I'm spinning up a 5.0 cluster now to run a test, so I'll >>> run a load test similar to what you've done and try to reproduce your >>> results. I'll also review the JIRA to get more familiar with what you're >>> working on. >>> > >>> > Benedict: I agree with your line of thinking around optimizing the >>> cost of metrics. As we push both density and multi-tenancy, there's going >>> to be more and more demand for clusters with hundreds or thousands of >>> tables. Maybe tens of thousands. Reducing overhead for something that's >>> O(N * M) (multiple counters per table) will definitely be a welcome >>> improvement. There's always more stuff that's going to get in the way, but >>> it's an elephant and I appreciate every bite. >>> > >>> > My main concern with metrics isn't really compatibility, and I don't >>> have any real investment in DropWizard. I don't know if there's any real >>> value in putting in effort to maintain compatibility, but I'm just one >>> sample, so I won't make a strong statement here. >>> > >>> > It would be *very nice* we moved to metrics which implement the Open >>> Telemetry Metrics API [1], which I think solves multiple issues at once: >>> > >>> > * We can use either one of the existing implementations (OTel SDK) or >>> our own >>> > * We get a "free" upgrade that lets people tap into the OTel ecosystem >>> > * It paves the way for OTel traces with ZipKin [2] / Jaeger [3] >>> > * We can use the ubiquitous Otel instrumentation agent to send metrics >>> to the OTel collector, meaning people can collect at a much higher >>> frequency than today >>> > * OTel logging is a significant improvement over logback, you can >>> coorelate metrics + traces + logs together. >>> > >>> > Anyways, if the plan is to rip out something old and unmaintained and >>> replace with something new, I think there's a huge win to be had by >>> implementing the standard that everyone's using now. >>> > >>> > All this is very exciting and I appreciate the discussion! >>> > >>> > Jon >>> > >>> > [1] https://opentelemetry.io/docs/languages/java/api/ >>> > [2] https://zipkin.io/ >>> > [3] https://www.jaegertracing.io/ >>> > >>> > >>> > >>> > >>> > On Wed, Mar 5, 2025 at 2:58 AM Dmitry Konstantinov <netud...@gmail.com> >>> wrote: >>> > >>> > Hi Jon >>> > >>> > >> Is there a specific workload you're running where you're seeing it >>> take up a significant % of CPU time? Could you share some metrics, profile >>> data, or a workload so I can try to reproduce your findings? >>> > Yes, I have shared the workload generation command (sorry, it is in >>> cassandra-stress, I have not yet adopted your tool but want to do it soon >>> :-) ), setup details and async profiler CPU profile in CASSANDRA-20250 >>> > A summary: >>> > >>> > it is a plain insert-only workload to assert a max throughput capacity >>> for a single node: ./tools/bin/cassandra-stress "write n=10m" -rate >>> threads=100 -node myhost >>> > small amount of data per row is inserted, local SSD disks are used, so >>> CPU is a primary bottleneck in this scenario (while it is quite synthetic >>> in my real business cases CPU is a primary bottleneck as well) >>> > I used 5.1 trunk version (similar results I have for 5.0 version while >>> I was checking CASSANDRA-20165) >>> > I enabled trie memetables + offheap objects mode >>> > I disabled compaction >>> > a recent nightly build is used for async-profiler >>> > my hardware is quite old: on-premise VM, Linux 4.18.0-240.el8.x86_64, >>> OpenJdk-11.0.26+4, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 16 cores >>> > link to CPU profile ("codahale" code: 8.65%) >>> > -XX:+DebugNonSafepoints option is enabled to improve the profile >>> precision >>> > >>> > >>> > On Wed, 5 Mar 2025 at 12:38, Benedict Elliott Smith < >>> bened...@apache.org> wrote: >>> > >>> > Some quick thoughts of my own… >>> > >>> > === Performance === >>> > - I have seen heap dumps with > 1GiB dedicated to metric counters. >>> This patch should improve this, while opening up room to cut it further, >>> steeply. >>> > - The performance improvement in relative terms for the metrics being >>> replaced is rather dramatic - about 80%.. We can also improve this further. >>> > - Cheaper metrics (in terms of both cpu and memory) means we can >>> readily have more of them, exposing finer-grained details. This is hard to >>> understate the value of. >>> > >>> > === Reporting === >>> > - We’re already non-standard for our most important metrics, because >>> we had to replace the Codahale histogram years ago >>> > - We can continue implementing the Codahale interfaces, so that >>> exporting libraries have minimal work to support us >>> > - We can probably push patches upstream to a couple of selected >>> libraries we consider important >>> > - I would anyway also support picking a new reporting framework to >>> support, but I would like us to do this with great care to avoid repeating >>> our mistakes. I won’t have cycles to actually implement this, so it would >>> be down to others to decide if they are willing to undertake this work >>> > >>> > I think the fallback option for now, however, is to abuse unsafe to >>> allow us to override the implementation details of Codahale metrics. So we >>> can decouple the performance discussion for now from the deprecation >>> discussion, but I think we should have a target of deprecating >>> Codahale/DropWizard for the reasons Dmitry outlines, however we decide to >>> do it. >>> > >>> > On 4 Mar 2025, at 21:17, Jon Haddad <j...@rustyrazorblade.com> wrote: >>> > >>> > I've got a few thoughts... >>> > >>> > On the performance side, I took a look at a few CPU profiles from past >>> benchmarks and I'm seeing DropWizard taking ~ 3% of CPU time. Is there a >>> specific workload you're running where you're seeing it take up a >>> significant % of CPU time? Could you share some metrics, profile data, or >>> a workload so I can try to reproduce your findings? In my testing I've >>> found the majority of the overhead from metrics to come from JMX, not >>> DropWizard. >>> > >>> > On the operator side, inventing our own metrics lib means risks making >>> it harder to instrument Cassandra. There are libraries out there that >>> allow you to tap into DropWizard metrics directly. For example, Sarma >>> Pydipally did a presentation on this last year [1] based on some code I >>> threw together. >>> > >>> > If you're planning on making it easier to instrument C* by supporting >>> sending metrics to the OTel collector [2], then I could see the change >>> being a net win as long as the perf is no worse than the status quo. >>> > >>> > It's hard to know the full extent of what you're planning and the >>> impact, so I'll save any opinions till I know more about the plan. >>> > >>> > Thanks for bringing this up! >>> > Jon >>> > >>> > [1] >>> https://planetcassandra.org/leaf/apache-cassandra-lunch-62-grafana-dashboard-for-apache-cassandra-business-platform-team/ >>> > [2] https://opentelemetry.io/docs/collector/ >>> > >>> > On Tue, Mar 4, 2025 at 12:40 PM Dmitry Konstantinov < >>> netud...@gmail.com> wrote: >>> > >>> > Hi all, >>> > >>> > After a long conversation with Benedict and Maxim in CASSANDRA-20250 I >>> would like to raise and discuss a proposal to deprecate Dropwizard/Codahale >>> metrics usage in the next major release of Cassandra server and drop it in >>> the following major release. >>> > Instead of it our own Java API and implementation should be >>> introduced. For the next major release Dropwizard/Codahale API is still >>> planned to support by extending Codahale implementations, to give potential >>> users of this API enough time for transition. >>> > The proposal does not affect JMX API for metrics, it is only about >>> local Java API changes within Cassandra server classpath, so it is about >>> the cases when somebody outside of Cassandra server code relies on Codahale >>> API in some kind of extensions or agents. >>> > >>> > Reasons: >>> > 1) Codahale metrics implementation is not very efficient from CPU and >>> memory usage point of view. In the past we already replaced default >>> Codahale implementations for Reservoir with our custom one and now in >>> CASSANDRA-20250 we (Benedict and I) want to add a more efficient >>> implementation for Counter and Meter logic. So, in total we do not have so >>> much logic left from the original library (mostly a MetricRegistry as >>> container for metrics) and the majority of logic is implemented by >>> ourselves. >>> > We use metrics a lot along the read and write paths and they >>> contribute a visible overhead (for example for plain write load it is about >>> 9-11% according to async profiler CPU profile), so we want them to be >>> highly optimized. >>> > From memory perspective Counter and Meter are built based on LongAdder >>> and they are quite heavy for the amounts which we create and use. >>> > >>> > 2) Codahale metrics does not provide any way to replace Counter and >>> Meter implementations. There are no full functional interfaces for these >>> entities + MetricRegistry has casts/checks to implementations and cannot >>> work with anything else. >>> > I looked through the already reported issues and found the following >>> similar and unsuccessful attempt to introduce interfaces for metrics: >>> https://github.com/dropwizard/metrics/issues/2186 >>> > as well as other older attempts: >>> > https://github.com/dropwizard/metrics/issues/252 >>> > https://github.com/dropwizard/metrics/issues/264 >>> > https://github.com/dropwizard/metrics/issues/703 >>> > https://github.com/dropwizard/metrics/pull/487 >>> > https://github.com/dropwizard/metrics/issues/479 >>> > https://github.com/dropwizard/metrics/issues/253 >>> > >>> > So, the option to request an extensibility from Codahale metrics does >>> not look real.. >>> > >>> > 3) It looks like the library is in maintenance mode now, 5.x version >>> is on hold and many integrations are also not so alive. >>> > The main benefit to use Codahale metrics should be a huge amount of >>> reporters/integrations but if we check carefully the list of reporters >>> mentioned here: >>> https://metrics.dropwizard.io/4.2.0/manual/third-party.html#reporters >>> > we can see that almost all of them are dead/archived. >>> > >>> > 4) In general, exposing other 3rd party libraries as our own public >>> API frequently creates too many limitations and issues (Guava is another >>> typical example which I saw previously, it is easy to start but later you >>> struggle more and more). >>> > >>> > Does anyone have any questions or concerns regarding this suggestion? >>> > -- >>> > Dmitry Konstantinov >>> > >>> > >>> > >>> > >>> > -- >>> > Dmitry Konstantinov >>> > >>> > >>> >>> >>> > > -- > Dmitry Konstantinov >