Hi Reid, Many thanks for this thoughtful response - very helpful and much appreciated.
No doubt some additional experimentation will pay off as you noted. One additional question: we currently use this heap setting: -XX:MaxRAMFraction=2 I realize every environment and its tuning goals are different; though - just generally - what do you think of MaxRAMFraction=2 with Java 8? If the stateful set is configured with 16Gi memory, that setting would allocate roughly 8Gi to the heap and seems a safe balance between heap/nonheap. No worries if you don't have enough information to answer (as I haven't shared our tuning goals), but any feedback is, again, appreciated. On Mon, Nov 4, 2019 at 10:28 AM Reid Pinchback <rpinchb...@tripadvisor.com> wrote: > Hi Ben, just catching up over the weekend. > > > > The typical advice, per Sergio’s link reference, is an obvious starting > point. We use G1GC and normally I’d treat 8gig as the minimal starting > point for a heap. What sometimes doesn’t get talked about in the myriad of > tunings, is that you have to have a clear goal in your mind on what you are > tuning **for**. You could be tuning for throughput, or average latency, > or 99’s latency, etc. How you tune varies quite a lot according to your > goal. The more your goal is about latency, the more work you have ahead of > you. > > > > I will suggest that, if your data footprint is going to stay low, that you > give yourself permission to do some experimentation. As you’re using K8s, > you are in a bit of a position where if your usage is small enough, you can > get 2x bang for the buck on your servers by sizing the pods to about 45% of > server resources and using the C* rack metaphor to ensure you don’t > co-locate replicas. > > > > For example, were I you, I’d start asking myself if SSTable compression > mattered to me at all. The reason I’d start asking myself questions like > that is C* has multiple uses of memory, and one of the balancing acts is > chunk cache and the O/S file cache. If I could find a way to make my O/S > file cache be a defacto C* cache, I’d roll up the shirt sleeves and see > what kind of performance numbers I could squeeze out with some creative > tuning experiments. Now, I’m not saying **do** that, because your write > volume also plays a roll, and you said you’re expecting a relatively even > balance in reads and writes. I’m just saying, by way of example, I’d start > weighing if the advice I get online was based in experience similar to my > current circumstance, or ones that were very different. > > > > R > > > > *From: *Ben Mills <b...@bitbrew.com> > *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Date: *Monday, November 4, 2019 at 8:51 AM > *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Subject: *Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC > > > > *Message from External Sender* > > Hi (yet again) Sergio, > > > > Finally, note that we use this sidecar > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Stackdriver_stackdriver-2Dprometheus-2Dsidecar&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=EP6Ql6dsh_bz1U49OKL6IYmkd51gf4VD6m2QwaQJ0ZM&s=m9OmSlwbgoGmO8jUYlAF6b4fbWx82f8NlqqQtOqlwhQ&e=> > for > shipping metrics to Stackdriver. It runs as a second container within our > Prometheus stateful set. > > > > > > On Mon, Nov 4, 2019 at 8:46 AM Ben Mills <b...@bitbrew.com> wrote: > > Hi (again) Sergio, > > > > I forgot to note that along with Prometheus, we use Grafana (with > Prometheus as its data source) as well as Stackdriver for monitoring. > > > > As Stackdriver is still developing (i.e. does not have all the features we > need), we tend to use it for the basics (i.e. monitoring and alerting on > memory, cpu and disk (PVs) thresholds). More specifically, the > Prometheus JMX exporter (noted above) scrapes all the MBeans inside > Cassandra, exporting in the Prometheus data model. Its config map filters > (allows) our metrics of interest, and those metrics are sent to our Grafana > instances and to Stackdriver. We use Grafana for more advanced metric > configs that provide deeper insight in Cassandra - e.g. read/write > latencies and so forth. For monitoring memory utilization, we monitor both > pod-level in Stackdriver (i.e. to avoid having a Cassandra pod oomkilled by > kubelet) as well as inside the JVM (heap space). > > > > Hope this helps. > > > > On Mon, Nov 4, 2019 at 8:26 AM Ben Mills <b...@bitbrew.com> wrote: > > Hi Sergio, > > > > Thanks for this and sorry for the slow reply. > > > > We are indeed still running Java 8 and so it's very helpful. > > > > This Cassandra cluster has been running reliably in Kubernetes for several > years, and while we've had some repair-related issues, they are not related > to container orchestration or the cloud environment. We don't use operators > and have simply built the needed Kubernetes configs (YAML manifests) to > handle deployment of new Docker images (when needed), and so forth. We have: > > > > (1) ConfigMap - Cassandra environment variables > > (2) ConfigMap - Prometheus configs for this JMX exporter > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_prometheus_jmx-5Fexporter&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=EP6Ql6dsh_bz1U49OKL6IYmkd51gf4VD6m2QwaQJ0ZM&s=l3csYnTFP-q25mQ57k36PlkMKj2OdN7JhM-vuSyKWh8&e=>, > which is built into the image and runs as a Java agent > > (3) PodDisruptionBudget - with minAvailable: 2 as the important setting > > (4) Service - this is a headless service (clusterIP: None) which specifies > the ports for cql, jmx, prometheus, intra-node > > (5) StatefulSet - 3 replicas, ports, health checks, resources, etc - as > you would expect > > > > We store data on persistent volumes using an SSD storage class, and use: > an updateStrategy of OnDelete, some affinity rules to ensure an even > spread of pods across our zones, Prometheus annotations for scraping the > metrics port, a nodeSelector and tolerations to ensure the Cassandra pods > run in their dedicated node pool, and a preStop hook that runs nodetool > drain to help with graceful shutdown when a pod is rolled. > > > > I'm guessing your installation is much larger than ours and so operators > may be a good way to go. For our needs the above has been very reliable as > has GCP in general. > > > > We are currently updating our backup/restore implementation to provide > better granularity with respect to restoring a specific keyspace and also > exploring Velero > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_vmware-2Dtanzu_velero&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=EP6Ql6dsh_bz1U49OKL6IYmkd51gf4VD6m2QwaQJ0ZM&s=70zhtNI28tFIrRscGslgaYQNrpcjuLOXKSCEuR3NoJw&e=> > for DR. > > > > Hope this helps. > > > > > > On Fri, Nov 1, 2019 at 5:34 PM Sergio <lapostadiser...@gmail.com> wrote: > > Hi Ben, > > Well, I had a similar question and Jon Haddad was preferring ParNew + CMS > over G1GC for java 8. > https://lists.apache.org/thread.html/283547619b1dcdcddb80947a45e2178158394e317f3092b8959ba879@%3Cuser.cassandra.apache.org%3E > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_283547619b1dcdcddb80947a45e2178158394e317f3092b8959ba879-40-253Cuser.cassandra.apache.org-253E&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=EP6Ql6dsh_bz1U49OKL6IYmkd51gf4VD6m2QwaQJ0ZM&s=2myv56frHk6jkFgNvr-j11Upv8niune5BmB9GjRCd2c&e=> > It depends on your JVM and in any case, I would test it based on your > workload. > > What's your experience of running Cassandra in k8s. Are you using the > Cassandra Kubernetes Operator? > > How do you monitor it and how do you perform disaster recovery backup? > > > Best, > > Sergio > > > > Il giorno ven 1 nov 2019 alle ore 14:14 Ben Mills <b...@bitbrew.com> ha > scritto: > > Thanks Sergio - that's good advice and we have this built into the plan. > > Have you heard a solid/consistent recommendation/requirement as to the > amount of memory heap requires for G1GC? > > > > On Fri, Nov 1, 2019 at 5:11 PM Sergio <lapostadiser...@gmail.com> wrote: > > In any case I would test with tlp-stress or Cassandra stress tool any > configuration > > > > Sergio > > > > On Fri, Nov 1, 2019, 12:31 PM Ben Mills <b...@bitbrew.com> wrote: > > Greetings, > > > > We are planning a Cassandra upgrade from 3.7 to 3.11.5 and considering a > change to the GC config. > > > > What is the minimum amount of memory that needs to be allocated to heap > space when using G1GC? > > > > For GC, we currently use CMS. Along with the version upgrade, we'll be > running the stateful set of Cassandra pods on new machine types in a new > node pool with 12Gi memory per node. Not a lot of memory but an > improvement. We may be able to go up to 16Gi memory per node. We'd like to > continue using these heap settings: > > > -XX:+UnlockExperimentalVMOptions > -XX:+UseCGroupMemoryLimitForHeap > -XX:MaxRAMFraction=2 > > > > which (if 12Gi per node) would provide 6Gi memory for heap (i.e. half of > total available). > > > > Here are some details on the environment and configs in the event that > something is relevant. > > > > Environment: Kubernetes > Environment Config: Stateful set of 3 replicas > Storage: Persistent Volumes > Storage Class: SSD > Node OS: Container-Optimized OS > Container OS: Ubuntu 16.04.3 LTS > Data Centers: 1 > Racks: 3 (one per zone) > Nodes: 3 > Tokens: 4 > Replication Factor: 3 > Replication Strategy: NetworkTopologyStrategy (all keyspaces) > Compaction Strategy: STCS (all tables) > Read/Write Requirements: Blend of both > Data Load: <1GB per node > gc_grace_seconds: default (10 days - all tables) > > GC Settings: (CMS) > > -XX:+UseParNewGC > -XX:+UseConcMarkSweepGC > -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 > -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 > -XX:+UseCMSInitiatingOccupancyOnly > -XX:CMSWaitDuration=30000 > -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways > > > > Any ideas are much appreciated. > >