Re: [DISCUSS] Chronicle Queue's development model and a hypothetical replacement of the library

Štefan Miklošovič Tue, 17 Sep 2024 01:56:57 -0700

There are configuration properties related to controlling what that bin log
does in runtime so if we completely changed the vehicle it operates on then
the only thing which would stay in common is the name of the command and
the logical operation it does (enable / disable, get the config if there is
any) ...


If we ever make another solution happen, I think it would be better if we
just kept the old stuff in and developed something parallel and when it is
stable enough we would ditch the old solution.

BTW I have one technical question here, not directed to Benedict as I reply
him but to the broader audience out there:

If this in the javadocs is true as I linked that above already:

"Performance safety is accomplished by feeding items to the binary log
using a weighted queue and dropping records if the binary log falls
sufficiently far behind."

then how is it possible that FQL works? If there is a chance to drop some
events, hence we dropped the actual query which was executed, then when we
replay the logs (FQL framework can replay the logs against an empty
database), then there is no guarantee that we actually get the same state
of the database after it is replayed? So FQL is in this sense "the best
effort" kind of tooling?

On Tue, Sep 17, 2024 at 10:37 AM Benedict <[email protected]> wrote:

> My point is only that AFAICT we use it for something incredibly basic that
> we do all the time elsewhere without it.
>
> I’m not proposing we remove it, I don’t have a position on that. But if we
> don’t _trust_ ourselves to replace it we should get out of the database
> game.
>
> The fact it would break compatibility between releases is suboptimal, but
> IMO not at all a dealbreaker because these files are not required to be
> compatible between versions - they’re offline logs and I think it would be
> fine to require different viewers for files produced by different versions
> of Cassandra.
>
> I do not think any of the nodetool methods would be affected by this, as
> they do not appear to touch the contents of the log files.
>
> On 17 Sep 2024, at 09:28, Štefan Miklošovič <[email protected]>
> wrote:
>
> 
> to Benedict:
>
> well ... I was not around when the decision about the usage of Chronicle
> Queues was made. I think that at that time it was the most obvious
> candidate without reinventing the wheel given the features and capabilities
> it had so taking something off the shelf was a natural conclusion.
>
> Josh / Jordan:
>
> not only FQL but Audit as well these are two separate things. There is
> also quite a "rich" ecosystem around that.
>
> 1) nodetool commands like
>
> enableauditlog
> enablefullquerylog
> disableauditlog
> disablefullquerylog
> getauditlog
> getfullquerylog
>
> Also, because the files it produces are binary, we need a special tooling
> to inspect it, it is in tools/fqltool with a bunch of classes, and there is
> also an AuditLogViewer for reviewing audit logs.
>
> There are MBean methods enabling nodetool commands.
>
> We have also shipped that in two major releases (4.0 and now in 5.0) so
> the community is quite well used to this, they have the processes set
> around this etc.
>
> I mention this all because it is just not so easy to replace it with
> something else if somebody wanted that, in any case. How do we even go
> around deprecating this if we are indeed going to replace that?
>
> To discuss the release aspect they have in place: I think you are right
> that the latest ea is as close as possible, if not the same, as what they
> release privately. Yes. But if we want to stick to the rule that we upgrade
> only to the latest ea relese before their next minor, then
>
> 1) we will be always at least one minor late
> 2) we do not know when they make up their minds to transition to a new
> minor so we can upgrade to the latest ea one minor before
> 3) if something is broken and we need to fix it and we are on ea, then
> what we get to update to is the latest ea at that time which might fix the
> issue but it will also bring new stuff in which might open doors to
> instability as well. So we update to fix the bugs but we might include new
> ones unknowingly.
>
> Anyway, I don't think this has any silver bullet solution, we might just
> stick to the latest "ea" and be done with it. I do not expect this project
> to evolve wildly and unpredictably, it just solves "one problem", there is
> basically nothing new coming in.
>
> Brandon:
>
> I understand your concerns about phoning home but
>
> 1) we already resolved this by setting the respective property
> 2) I do not think that Chronicle will mess with this once they introduce
> that. There is nothing to "improve" or "change" there. It is phoning home
> or not and it is driven by one property. If they made a change that we can
> not turn it off then we would really be in trouble but for now we are not
> and practically speaking I don't expect this would change.
>
> I know that this might sound like wishful thinking but in practical terms
> I really just don't expect this phoning home thing would come back ever.
>
> Speaking of alternatives, I think the primary reason Chronicle was used is
> this (1).
>
> "It's goal is good enough performance, predictable footprint, simplicity
> in terms of implementation and configuration and most importantly minimal
> impact on producers of log records."
>
> While I understand English (I guess, well enough :D), I just don't
> understand what "good enough performance" is. How is this measured? What is
> a "predictable footprint"? Was that measured too? How did we quantify that?
>
> " Performance safety is accomplished by feeding items to the binary log
> using a weighted queue and dropping records if the binary log falls
> sufficiently far behind."
>
> This is interesting, if I understand correctly, the messages are weighted
> and the heavier they are, the more probable it is they will be dropped when
> it is overloaded? Or vice versa, the tighter ones are dropped first?
>
> Have we _ever_ experienced in production that some log events were really
> dropped? Has anybody ever hit that?
>
> When it comes to alternatives, what about logback + slf4j? It has
> appenders where we want, it is sync / async, we can code some nio appender
> too I guess, it logs it as text into a file so we do not need any special
> tooling to review that. For tailing which Chronicle also offers, I guess
> "tail -f that.log" just does the job? logback even rolls the files after
> they are big enough so it rolls the files the same way after some
> configured period / size as Chronicle does (It even compresses the logs).
>
> Do we log so much so that battle-tested logback is just absolutely not
> enough for us? Come on, this is not a rocket science that we need to use a
> library from the realm of "high frequency trading" to just append queries
> and audit logs as they are executed. logback can handle the load we have
> just fine imo ...
>
> Or maybe I am completely wrong and we just HAVE TO use Chronicle?
>
> (1)
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/binlog/BinLog.java#L58-L69
>
>
> On Tue, Sep 17, 2024 at 3:12 AM Brandon Williams <[email protected]> wrote:
>
>> My concern is that we have to keep making sure it's not phoning home(1,2).
>>
>> (1) https://issues.apache.org/jira/browse/CASSANDRA-18538
>> (2) https://issues.apache.org/jira/browse/CASSANDRA-19656
>>
>> Kind Regards,
>> Brandon
>>
>> On Mon, Sep 16, 2024 at 7:53 PM Josh McKenzie <[email protected]>
>> wrote:
>> >
>> > I think it's FQLTool only right now; I bumped into it recently doing
>> the JDK21 compat work.
>> >
>> > I'm not concerned about current usage / dependency, but if our usage
>> expands this could start to become a problem and that's going to be a hard
>> thing to track and mange.
>> >
>> > So reading through those issues Stefan, I think it boils down to:
>> >
>> > The latest ea is code identical to the stable release
>> > Subsequent bugfixes get applied to the customer-only stable branch and
>> one release forward
>> > Projects running ea releases would need to cherry-pick those bugfixes
>> back or run on the next branch's ea, which could introduce the project to
>> API changes or other risks
>> >
>> > Assuming that's the case... blech. Our exposure is low, but that seems
>> like a real pain.
>> >
>> > On Mon, Sep 16, 2024, at 5:16 PM, Benedict wrote:
>> >
>> >
>> > Don’t we essentially just use it as a file format for storing a couple
>> of kinds of append-only data?
>> >
>> > I was never entirely clear on the value it brought to the project.
>> >
>> >
>> > On 16 Sep 2024, at 22:11, Jordan West <[email protected]> wrote:
>> >
>> > 
>> > Thanks for the sleuthing Stefan! This definitely is a bit unfortunate.
>> It sounds like a replacement is not really practical so I'll ignore that
>> option for now, until a viable alternative is proposed. I am -1 on us
>> writing our own without strong, strong justification -- primarily because I
>> think the likelihood is we introduce more bugs before getting to something
>> stable.
>> >
>> > Regarding the remaining options, mostly some thoughts:
>> >
>> > - it would be nice to have some specific evidence of other projects
>> using the EA versions and what their developers have said about it.
>> > - it sounds like if we go with the EA route, the onus to test for
>> correctness / compatibility increases. They do test but anything marked
>> "early access" I think deserves more scrutiny from the C* community before
>> release. That could come in the form of more tests (or showing that we
>> already have good coverage of where its used).
>> > - i assume each time we upgrade we would pick the most recently
>> released EA version
>> >
>> > Jordan
>> >
>> >
>> > On Mon, Sep 16, 2024 at 1:46 PM Štefan Miklošovič <
>> [email protected]> wrote:
>> >
>> > We are using a library called Chronicle Queue (1) and its dependencies
>> and we ship them in the distribution tarball.
>> >
>> > The version we use in 5.0 / trunk as I write this is 2.23.36. If you
>> look closely here (2), there is one more release like this, 2.23.37 and
>> after that all these releases have "ea" in their name.
>> >
>> > "ea" stands for "early access". The project has changed the versioning
>> / development model in such a way that "ea" releases act, more or less, as
>> glorified snapshots which are indeed released to Maven Central but the
>> "regular" releases are not there. The reason behind this is that "regular"
>> releases are published only for customers who pay to the company behind
>> this project and they offer commercial support for that.
>> >
>> > "regular" releases are meant to get all the bug fixes after "ea" is
>> published and they are official stable releases. On the other hand "ea"
>> releases are the ones where the development happens and every now and then,
>> once the developers think that it is time to cut new 2.x, they just publish
>> that privately.
>> >
>> > I was investigating how this all works here (3) and while they said
>> that, I quote (4):
>> >
>> > "In my experience this is consumed by a large number of open source
>> projects reliably (for our other artifacts too). This development/ea branch
>> still goes through an extensive test suite prior to release. Releases from
>> this branch will contain the latest features and bug fixes."
>> >
>> > I am not completely sure if we are OK with this. For the record, Mick
>> is not overly comfortable with that and Brandon would prefer to just
>> replace it / get rid of this dependency (comments / reasons / discussion
>> from (5) to the end)
>> >
>> > The question is if we are OK with how things are and if we are then
>> what are the rules when upgrading the version of this project in Cassandra
>> in the context of "ea" versions they publish.
>> >
>> > If we are not OK with this, then the question is what we are going to
>> replace it with.
>> >
>> > If we are going to replace it, I very briefly took a look and there is
>> practically nothing out there which would hit all the buttons for us.
>> Chronicle is just perfect for this job and I am not a fan of rewriting this
>> at all.
>> >
>> > I would like to have this resolved because there is CEP-12 I plan to
>> deliver and I hit this and I do not want to base that work on something we
>> might eventually abandon. There are some ideas for CEP-12 how to bypass
>> this without using Chronicle but I would like to firstly hear your opinion.
>> >
>> > Regards
>> >
>> > (1) https://github.com/OpenHFT/Chronicle-Queue
>> > (2) https://repo1.maven.org/maven2/net/openhft/chronicle-core/
>> > (3) https://github.com/OpenHFT/Chronicle-Core/issues/668
>> > (4)
>> https://github.com/OpenHFT/Chronicle-Core/issues/668#issuecomment-2322038676
>> > (5)
>> https://issues.apache.org/jira/browse/CASSANDRA-18712?focusedCommentId=17878254&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17878254
>> >
>> >
>>
>

Re: [DISCUSS] Chronicle Queue's development model and a hypothetical replacement of the library

Reply via email to