@Tommy do you think? You brought the issue up, I am assuming because you found the issue while trying to test ecaudit against the proposed release and it broke the integration? As an active consumer of the interface what are your thoughts? On Aug 1, 2024, at 8:17 AM, Alex Petrov <al...@coffeenco.de> wrote:
> If we have a path that resolves the issue and also maintains full compatibility for this (semi- / reluctantly-accessible) interface, that would seem ideal. Interested to learn more about the drawbacks to that approach.
My thinking here was that people who might have a binary dependency on this interface have to recompile their code, they may as well change 2 lines by adding a call to from the new method with `requestTime.startedAtNanos()`. I am not in a strong opposition to merging it though. If there is general agreement that this is the best way, let's do this: I do not see any drawbacks in terms of performance or otherwise.
If we decide to move forward with, it, the patch is up [1].
On Wed, Jul 31, 2024, at 11:24 PM, C. Scott Andreas wrote:
Sorry to veer off from a vote in a vote thread.
@Alex, can you say more about this statement:
> "I think I would prefer to not introduce the change I have proposed (the one that would bring back non-binary compatibility)."
If we have a path that resolves the issue and also maintains full compatibility for this (semi- / reluctantly-accessible) interface, that would seem ideal. Interested to learn more about the drawbacks to that approach.
Regarding the value of C-19534 I'm happy to attest to the fact that it addresses severe metastable failure modes in clusters under heavy traffic on the verge of tipping. Jon Haddad's independent testing validated this as discussed on the ticket as well: https://issues.apache.org/jira/browse/CASSANDRA-19534
Last, @Tommy this is a great catch and I'm glad you raised it. Thanks for watching so closely and appreciate you bringing it to everyone's attention.
– Scott
On Jul 31, 2024, at 1:05 PM, Caleb Rackliffe <calebrackli...@gmail.com> wrote:
+1 to proceeding with a simple upgrade note in NEWS
Unfortunately, I can not immediately see a good way to provide the critical bugfix of CASSANDRA-19534, affecting all Cassandra users, without making at least some change in this API.
I personally think that this method is very tightly coupled to the implementation to expose it via -D. If anyone using it could provide some context about why it is an important part of API, it would be give some useful context.
Nobody stepping up to engage on the technical piece of this? Unless / until somebody does, Alex' argument holds the most weight as the expert with what's going on IMO.
The question we're facing is - when we find a defect that requires a change in a public facing API, which of the following 2 is more important:
- Keeping the API stable
- Having the defect resolved
Obviously this will be case-by-case. What CASSANDRA-19534 addresses:
When a node is under pressure, hundreds of thousands of requests can show up in the native transport queue, and it looks like it can take way longer to timeout than is configured.
...
After stopping the load test altogether, it took nearly a minute before the requests were no longer queued.
I believe our priority here should be having this defect resolved.
On Tue, Jul 30, 2024, at 1:43 PM, Jordan West wrote:
I would make the case that loss of availability / significant performance issue, regardless of the amount of time it has existed for, is worth fixing on the branches that are widely deployed by the community. Especially when weighed against a loosely defined public interface issue.
The queuing issue has been a persistent problem (like you said 10 years) and I regularly (approx once every 1-2 weeks) have to tell my customers “we either have to wait for Cassandra to clear the queues or do a rolling restart to fix it” both which come at a cost during an incident where a client overloaded the DB and the impact is severe or business impacting. Especially for customers doing LWTs or using non-standard RFs which are also more prevalent in my experience than an external implementation of QueryHandler.
While not data loss, I would argue this is a critical bug and if we did find a data loss issue dormant for 10 years (which has happened in the past) we would fix it as soon as it was found and a patch was made available on all actively maintained versions.
Jordan
It’s a 10 year old flaw in an 18 month old branch. Why does it need to go into 4.1, it’s not a regression and it clearly breaks compatibility?
This patch fixes a long standing issue that's the root cause of availability failures. Even though folks can specify a custom query handler with the -D flag, the number of users impacted by this is going to be incredibly small. On the other hand, the fix helps every single user of 4.1 that puts too much pressure on the cluster, which happens fairly regularly.
My POV is that it's a fairly weak argument that this is a public interface, but I don't consider it worth debating whether it is or not, because even if it is, this improves stability of the database for all users, so it's worth going in. Let's not be dogmatic about fixes that help 99% of users because an incredibly small number that actually implement a custom query handler will need to make a trivial update in order to use the latest 4.1.6 dependency.
Jon
Given we allow a pluggable query handler implementation to be specified for the server with a -D during startup. So I would consider the query handler one of our public interfaces.
Hi Tommy,
Thank you for spotting this and bringing this to community's attention.
I believe our primary interfaces are native and internode protocol, and CLI tools. Most interfaces are used to to abstract implementations internally. Few interfaces, such as DataType, Partitioner, and Triggers can be depended upon by external tools using Cassandra as a library. There is no official way to plug in a QueryHandler, so I did not consider it to be a part of our public API.
From [1]:
> These considerations are especially important for public APIs, including CQL, virtual tables, JMX, yaml, system properties, etc. Any planned additions must be carefully considered in the context of any existing APIs. Where possible the approach of any existing API should be followed.
Maybe we should have an exhaustive list of public APIs, and explicitly mention that native and internode protocols are included, alongside with nodetool command API and output, but also which classes/interfaces specifically should be evolved with care.
Thank you,
--Alex
On Tue, Jul 30, 2024, at 10:56 AM, Tommy Stendahl via dev wrote:
Hi,
Do we allow changes such changes between 4.1.5 and 4.1.6?
CASSANDRA-19534 looks like a very good change so maybe there is an exception in this case?
/Tommy
-----Original Message-----
Subject: [VOTE] Release Apache Cassandra 4.1.6
Date: Mon, 29 Jul 2024 09:36:04 -0500
Proposing the test build of Cassandra 4.1.6 for release.
sha1: b662744af59f3a3dfbfeb7314e29fecb93abfd80
Git:
https://eur02.safelinks.protection.outlook.com/?url="">
Maven Artifacts:
https://eur02.safelinks.protection.outlook.com/?url="">
The Source and Build Artifacts, and the Debian and RPM packages and
repositories, are available here:
https://eur02.safelinks.protection.outlook.com/?url="">
The vote will be open for 72 hours (longer if needed). Everyone who
has tested the build is invited to vote. Votes by PMC members are
considered binding. A vote passes if there are at least three binding
+1s and no -1's.
[1]: CHANGES.txt:
https://eur02.safelinks.protection.outlook.com/?url="">
[2]: NEWS.txt:
https://eur02.safelinks.protection.outlook.com/?url="">
Kind Regards,
Brandon
|