TLP tools for stress testing and building test clusters in AWS

Jon Haddad Fri, 12 Apr 2019 08:35:32 -0700

I don't want to derail the discussion about Stabilizing Internode
Messaging, so I'm starting this as a separate thread.  There was a
comment that Josh made [1] about doing performance testing with real
clusters as well as a lot of microbenchmarks, and I'm 100% in support
of this.  We've been working on some tooling at TLP for the last
several months to make this a lot easier.  One of the goals has been
to help improve the 4.0 testing process.

The first tool we have is tlp-stress [2]. It's designed with a "get
started in 5 minutes" mindset. My goal was to ship a stress tool that
ships with real workloads out of the box that can be easily tweaked,
similar to how fio allows you to design a disk workload and tweak it
with paramaters. Included are stress workloads that stress LWTs (two
different types), materialized views, counters, time series, and
key-value workloads. Each workload can be modified easily to change
compaction strategies, concurrent operations, number of partitions.
We can run workloads for a set number of iterations or a custom
duration. We've used this *extensively* at TLP to help our customers
and most of our blog posts that discuss performance use it as well.
It exports data to both a CSV format and auto sets up prometheus for
metrics collection / aggregation. As an example, we were able to
determine that the compression length set on the paxos tables imposes
a significant overhead when using the Locking LWT workload, which
simulates locking and unlocking of rows. See CASSANDRA-15080 for
details.

We have documentation [3] on the TLP website.

The second tool we've been working on is tlp-cluster [4]. This tool
is designed to help provision AWS instances for the purposes of
testing. To be clear, I don't expect, or want, this tool to be used
for production environments. It's designed to assist with the
Cassandra build process by generating deb packages or re-using the
ones that have already been uploaded. Here's a short list of the
things you'll care about:

1. Create instances in AWS for Cassandra using any instance size and
number of nodes. Also create tlp-stress instances and a box for
monitoring
2. Use any available build of Cassandra, with a quick option to change
YAML config. For example: tlp-stress use 3.11.4 -c
concurrent_writes:256
3. Do custom builds just by pointing to a local Cassandra git repo.
They can be used the same way as #2.
4. tlp-stress is automatically installed on the stress box.
5. Everything's installed with pure bash. I considered something more
complex, but since this is for development only, it turns out the
simplest tool possible works well and it means it's easily
configurable. Just drop in your own bash script starting with a
number in a XX_script_name.sh format and it gets run.
6. The monitoring box is running Prometheus. It auto scrapes
Cassandra using the Instaclustr metrics library.
7. Grafana is also installed automatically. There's a couple sample
graphs there now. We plan on having better default graphs soon.

For the moment it installs java 8 only but that should be easily
fixable to use java 11 to test ZGC (it's on my radar).

Documentation for tlp-cluster is here [5].

There's still some things to work out in the tool, and we've been
working hard to smooth out the rough edges. I still haven't announced
anything WRT tlp-cluster on the TLP blog, because I don't think it's
quite ready for public consumption, but I think the folks on this list
are smart enough to see the value in it even if it has a few warts
still.

I don't consider myself familiar enough with the networking patch to
give it a full review, but I am qualified to build tools to help test
it and go through the testing process myself. From what I can tell
the patch is moving the codebase in a positive direction and I'd like
to help build confidence in it so we can get it merged in.

We'll continue to build out and improve the tooling with the goal of
making it easier for people to jump into the QA side of things.

Jon

[1]
https://lists.apache.org/thread.html/742009c8a77999f4b62062509f087b670275f827d0c1895bf839eece@%3Cdev.cassandra.apache.org%3E
[2] https://github.com/thelastpickle/tlp-stress
[3] http://thelastpickle.com/tlp-stress/
[4] https://github.com/thelastpickle/tlp-cluster
[5] http://thelastpickle.com/tlp-cluster/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

TLP tools for stress testing and building test clusters in AWS

Reply via email to