One of the explicit goals of making an official sidecar project was to try to make it something the project does not break compatibility with as one of the main issues the third-party sidecars (that handle distributed control, backup, repair, etc ...) have is they break constantly because C* breaks the control interfaces (JMX and config files in particular) constantly. If it helps with the mental model, maybe think of the Cassandra sidecar as part of the Cassandra distribution and we try not to break the distribution? Just like we can't break CQL and break the CQL client ecosystem, we hopefully don't break control interfaces of the sidecar either.
On Tue, Mar 28, 2023 at 10:30 AM Jeremiah D Jordan <jeremiah.jor...@gmail.com> wrote: > > - Resources isolation. Having the said service running within the same JVM > may negatively impact Cassandra storage's performance. It could be more > beneficial to have them in Sidecar, which offers strong resource isolation > guarantees. > > How does having this in a side car change the impact on “storage > performance”? The side car reading sstables will have the same impact on > storage IO as the main process reading sstables. Given the sidecar is > running on the same node as the main C* process, the only real resource > isolation you have is in heap/GC? CPU/Memory/IO are all still shared between > the main C* process and the side car, and coordinating those across processes > is harder than coordinating them within a single process. For example if we > wanted to have the compaction throughput, streaming throughput, and analytics > read throughput all tied back to a single disk IO cap, that is harder with an > external process. I think we might be underselling how valuable JVM isolation is, especially for analytics queries that are going to pass the entire dataset through heap somewhat constantly. In addition to that, having this in a separate process gives us access to easy-to-use OS level protections over CPU time, memory, network, and disk via cgroups; as well as taking advantage of the existing isolation techniques kernels already offer to protect processes from each other e.g. CPU schedulers like CFS [1], network qdiscs like tc-fq/tc-prio[2, 3], and io schedulers like kyber/bfq [4]. Mixing latency sensitive point queries with throughput sensitive ones in the same JVM just seems fraught with peril and I don't buy we will build the same level of performance isolation that the kernel has. Note you do not need containers to do this, the kernel by default uses these isolation mechanisms to enforce fairness to resources, cgroups just make it better (and can be used regardless of containerization). This was the thinking behind backup/restore, repair, bulk operations, etc ... living in a separate process. As has been mentioned elsewhere, being able to run that workload on different physical machines is even better to isolate, and I could totally see a wonderful architecture in the future where you have sidecar doing incremental backups from source nodes and restores every ~10 minutes to the "analytics" nodes where spark bulk readers are pointed. For isolation the best would be a separate process on a separate machine, followed by a separate process on the same machine, followed by a separate thread on the same machine (historically what C* does) ... now thats not so say we need to go straight to best, but we probably shouldn't do the worst thing? -Joey [1] https://man7.org/linux/man-pages/man7/sched.7.html [2] https://man7.org/linux/man-pages/man8/tc-fq.8.html [3] https://man7.org/linux/man-pages/man8/tc-prio.8.html [4] https://docs.kernel.org/block/index.html