Re: [DISCUSS] Chronicle Queue's development model and a hypothetical replacement of the library

Jon Haddad Sun, 29 Sep 2024 12:04:27 -0700

Strong +1 to the file format issue, and if we're building a wish list - it
would be great if we could read the file format without pulling in
cassandra-all.  Long term, I'd love to see this for SSTables & Commit logs
as well.


I've long been a fan of Gradle subprojects because it makes this kind of
thing fairly easy.

Jon



On Sun, Sep 29, 2024 at 11:46 AM Andrew Weaver <andrewjwea...@gmail.com>
wrote:

> I'm late to the discussion here, but I want to add my experience from
> dealing with audit logs specifically.
>
> Chronicle has some advantages (binary, compact) but it has a serious
> disadvantage from a consumption standpoint. It's not a well-supported file
> format. Audit logs are something that I think most operators are interested
> in archiving for compliance purposes and analyzing offline for any number
> of reasons and an oddball file format is an unnecessary hurdle for the
> audit logs use-case.
>
> I would welcome support for an existing format that is compact,
> high-performance and compatible with common tools (Spark, etc.).
>
> On Sun, Sep 29, 2024, 10:11 AM Štefan Miklošovič <smikloso...@apache.org>
> wrote:
>
>> Thank you all for your answers and opinions. I would like to have some
>> kind of a resolution here in order to move forward, especially with
>> relation to CEP-12 I mentioned earlier. (1).
>>
>> I think we have these options:
>>
>> 1) Do nothing and wait until this gets back to us, probably in a more
>> serious way (we find a bug and we will not be able to update it because it
>> would be on "ea" or new features will be available only in newer versions).
>>
>> 2) Fork it and continue to maintain it - I do not think this is
>> realistic, nobody is going to take care of forking that and maintaining it
>> long term.
>>
>> 3) Do nothing but refactor it in such a way that it will be easier to
>> replace it with something else in the future. CEP-12 is not only adding
>> persistence to diagnostic events but the patch I have also makes whole
>> logging more robust. Even it is all on Chronicle Queues (FQL, Audit ...),
>> there are some differences between that when it comes to the implementation
>> and I think that refactoring it in such a way that it would have all clear
>> class structure and hierarchy (bottom of CEP-12) we will have easier job if
>> we ever go to replace that.
>>
>> 4) Proceed with CEP-12 even though we know we are building it on top of
>> something which should not be there.
>>
>> 5) Do absolutely nothing until we replace it with something else and we
>> get rid of what is there right now - that would mean that we will not
>> benefit from the code which is easier to maintain etc (if CEP-12 is not
>> going to materialize) which I think is a welcomed attribute of the code
>> base to have.
>>
>> I was thinking more about stuff like protobuf and while I do see benefits
>> of that, honestly, it just does not matter too much if it is done like that
>> or not. I mean, sure, it would be cool to have, but we could spend a lot of
>> effort on protobuf and integrating with it or on anything which would make
>> the consumption of these events language-agnostic but these are quite niche
>> scenarios and I think that time might be used somewhere else more
>> effectively.
>>
>> The bottom line is that I am reluctant to do anything unless CEP-12 makes
>> it in one way or another (either with diagnostic persistence or without it
>> but with a nice refactoring) and, let's get real here, I do not think that
>> anybody is going to spend any time on this particular piece of the
>> functionality either. So the net result is that it will be either
>> athrophying or we at least clean it up so whoever comes next has an easier
>> job to replace it.
>>
>> (1)
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-12+Diagnostics+events+persistence+and+their+exposure+in+virtual+tables
>>
>> On Tue, Sep 24, 2024 at 6:11 PM Ariel Weisberg <ar...@weisberg.ws> wrote:
>>
>>> Hi,
>>>
>>> I just don't understand what "good enough performance" is.
>>>
>>> Should really specify throughput. There is a single thread writing
>>> records to the log and it's a bottleneck around a few hundred thousand
>>> entries/sec and 1gb/sec. It doesn't scale to arbitrary throughput
>>> requirements.
>>>
>>> What is a "predictable footprint"? Was that measured too? How did we
>>> quantify that?
>>>
>>> You can set a rolling cycle to limit the size of the log. It's not that
>>> predictable disk space wise because rolling is time based, and that is one
>>> of the things I don't like about Chronicle.
>>>
>>> This is interesting, if I understand correctly, the messages are
>>> weighted and the heavier they are, the more probable it is they will be
>>> dropped when it is overloaded? Or vice versa, the tighter ones are dropped
>>> first?
>>>
>>> It's still a FIFO queue. Elements aren't dropped from the queue they are
>>> dropped by the producers who don't have to wait for the consumer of the
>>> queue to catch up. The queue size is described in terms of weight not
>>> number of elements so it can bound memory usage.
>>>
>>> Have we _ever_ experienced in production that some log events were
>>> really dropped? Has anybody ever hit that?
>>>
>>> Dropping samples is off by default so it can be used in a lossless way.
>>>
>>> Notionally one of the use cases of full query logging is that you have a
>>> cluster that is overloaded and want to find out what is causing it. These
>>> nodes maybe low on IO/CPU and turning on the full query log could cause
>>> additional timeouts so one goal of the full query log is that enabling it
>>> shouldn't make things worse.
>>>
>>> That is the motivation for memory limits and not blocking request
>>> threads on IO. Really there should also be rate limits and random sampling
>>> because right now dropping samples will be biased towards dropping large
>>> footprint samples.
>>>
>>> David Capwell mentioned some performance issues. I recall we talked
>>> about it and I did a quick microbenchmark and didn't have a problem writing
>>> records (1 gigabyte/sec, hundreds of thousands of entries) so I am not sure
>>> what scenarios is where performance is bad and whether it is addressable.
>>> Not sure it matters since Chronicle's approach to OSS is so problematic.
>>>
>>> Ariel
>>>
>>> On Tue, Sep 17, 2024, at 4:27 AM, Štefan Miklošovič wrote:
>>>
>>> to Benedict:
>>>
>>> well ... I was not around when the decision about the usage of Chronicle
>>> Queues was made. I think that at that time it was the most obvious
>>> candidate without reinventing the wheel given the features and capabilities
>>> it had so taking something off the shelf was a natural conclusion.
>>>
>>> Josh / Jordan:
>>>
>>> not only FQL but Audit as well these are two separate things. There is
>>> also quite a "rich" ecosystem around that.
>>>
>>> 1) nodetool commands like
>>>
>>> enableauditlog
>>> enablefullquerylog
>>> disableauditlog
>>> disablefullquerylog
>>> getauditlog
>>> getfullquerylog
>>>
>>> Also, because the files it produces are binary, we need a special
>>> tooling to inspect it, it is in tools/fqltool with a bunch of classes, and
>>> there is also an AuditLogViewer for reviewing audit logs.
>>>
>>> There are MBean methods enabling nodetool commands.
>>>
>>> We have also shipped that in two major releases (4.0 and now in 5.0) so
>>> the community is quite well used to this, they have the processes set
>>> around this etc.
>>>
>>> I mention this all because it is just not so easy to replace it with
>>> something else if somebody wanted that, in any case. How do we even go
>>> around deprecating this if we are indeed going to replace that?
>>>
>>> To discuss the release aspect they have in place: I think you are right
>>> that the latest ea is as close as possible, if not the same, as what they
>>> release privately. Yes. But if we want to stick to the rule that we upgrade
>>> only to the latest ea relese before their next minor, then
>>>
>>> 1) we will be always at least one minor late
>>> 2) we do not know when they make up their minds to transition to a new
>>> minor so we can upgrade to the latest ea one minor before
>>> 3) if something is broken and we need to fix it and we are on ea, then
>>> what we get to update to is the latest ea at that time which might fix the
>>> issue but it will also bring new stuff in which might open doors to
>>> instability as well. So we update to fix the bugs but we might include new
>>> ones unknowingly.
>>>
>>> Anyway, I don't think this has any silver bullet solution, we might just
>>> stick to the latest "ea" and be done with it. I do not expect this project
>>> to evolve wildly and unpredictably, it just solves "one problem", there is
>>> basically nothing new coming in.
>>>
>>> Brandon:
>>>
>>> I understand your concerns about phoning home but
>>>
>>> 1) we already resolved this by setting the respective property
>>> 2) I do not think that Chronicle will mess with this once they introduce
>>> that. There is nothing to "improve" or "change" there. It is phoning home
>>> or not and it is driven by one property. If they made a change that we can
>>> not turn it off then we would really be in trouble but for now we are not
>>> and practically speaking I don't expect this would change.
>>>
>>> I know that this might sound like wishful thinking but in practical
>>> terms I really just don't expect this phoning home thing would come back
>>> ever.
>>>
>>> Speaking of alternatives, I think the primary reason Chronicle was used
>>> is this (1).
>>>
>>> "It's goal is good enough performance, predictable footprint, simplicity
>>> in terms of implementation and configuration and most importantly minimal
>>> impact on producers of log records."
>>>
>>> While I understand English (I guess, well enough :D), I just don't
>>> understand what "good enough performance" is. How is this measured? What is
>>> a "predictable footprint"? Was that measured too? How did we quantify that?
>>>
>>> " Performance safety is accomplished by feeding items to the binary log
>>> using a weighted queue and dropping records if the binary log falls
>>> sufficiently far behind."
>>>
>>> This is interesting, if I understand correctly, the messages are
>>> weighted and the heavier they are, the more probable it is they will be
>>> dropped when it is overloaded? Or vice versa, the tighter ones are dropped
>>> first?
>>>
>>> Have we _ever_ experienced in production that some log events were
>>> really dropped? Has anybody ever hit that?
>>>
>>> When it comes to alternatives, what about logback + slf4j? It has
>>> appenders where we want, it is sync / async, we can code some nio appender
>>> too I guess, it logs it as text into a file so we do not need any special
>>> tooling to review that. For tailing which Chronicle also offers, I guess
>>> "tail -f that.log" just does the job? logback even rolls the files after
>>> they are big enough so it rolls the files the same way after some
>>> configured period / size as Chronicle does (It even compresses the logs).
>>>
>>> Do we log so much so that battle-tested logback is just absolutely not
>>> enough for us? Come on, this is not a rocket science that we need to use a
>>> library from the realm of "high frequency trading" to just append queries
>>> and audit logs as they are executed. logback can handle the load we have
>>> just fine imo ...
>>>
>>> Or maybe I am completely wrong and we just HAVE TO use Chronicle?
>>>
>>> (1)
>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/binlog/BinLog.java#L58-L69
>>>
>>> On Tue, Sep 17, 2024 at 3:12 AM Brandon Williams <dri...@gmail.com>
>>> wrote:
>>>
>>> My concern is that we have to keep making sure it's not phoning
>>> home(1,2).
>>>
>>> (1) https://issues.apache.org/jira/browse/CASSANDRA-18538
>>> (2) https://issues.apache.org/jira/browse/CASSANDRA-19656
>>>
>>> Kind Regards,
>>> Brandon
>>>
>>> On Mon, Sep 16, 2024 at 7:53 PM Josh McKenzie <jmcken...@apache.org>
>>> wrote:
>>> >
>>> > I think it's FQLTool only right now; I bumped into it recently doing
>>> the JDK21 compat work.
>>> >
>>> > I'm not concerned about current usage / dependency, but if our usage
>>> expands this could start to become a problem and that's going to be a hard
>>> thing to track and mange.
>>> >
>>> > So reading through those issues Stefan, I think it boils down to:
>>> >
>>> > The latest ea is code identical to the stable release
>>> > Subsequent bugfixes get applied to the customer-only stable branch and
>>> one release forward
>>> > Projects running ea releases would need to cherry-pick those bugfixes
>>> back or run on the next branch's ea, which could introduce the project to
>>> API changes or other risks
>>> >
>>> > Assuming that's the case... blech. Our exposure is low, but that seems
>>> like a real pain.
>>> >
>>> > On Mon, Sep 16, 2024, at 5:16 PM, Benedict wrote:
>>> >
>>> >
>>> > Don’t we essentially just use it as a file format for storing a couple
>>> of kinds of append-only data?
>>> >
>>> > I was never entirely clear on the value it brought to the project.
>>> >
>>> >
>>> > On 16 Sep 2024, at 22:11, Jordan West <jw...@apache.org> wrote:
>>> >
>>> > 
>>> > Thanks for the sleuthing Stefan! This definitely is a bit unfortunate.
>>> It sounds like a replacement is not really practical so I'll ignore that
>>> option for now, until a viable alternative is proposed. I am -1 on us
>>> writing our own without strong, strong justification -- primarily because I
>>> think the likelihood is we introduce more bugs before getting to something
>>> stable.
>>> >
>>> > Regarding the remaining options, mostly some thoughts:
>>> >
>>> > - it would be nice to have some specific evidence of other projects
>>> using the EA versions and what their developers have said about it.
>>> > - it sounds like if we go with the EA route, the onus to test for
>>> correctness / compatibility increases. They do test but anything marked
>>> "early access" I think deserves more scrutiny from the C* community before
>>> release. That could come in the form of more tests (or showing that we
>>> already have good coverage of where its used).
>>> > - i assume each time we upgrade we would pick the most recently
>>> released EA version
>>> >
>>> > Jordan
>>> >
>>> >
>>> > On Mon, Sep 16, 2024 at 1:46 PM Štefan Miklošovič <
>>> smikloso...@apache.org> wrote:
>>> >
>>> > We are using a library called Chronicle Queue (1) and its dependencies
>>> and we ship them in the distribution tarball.
>>> >
>>> > The version we use in 5.0 / trunk as I write this is 2.23.36. If you
>>> look closely here (2), there is one more release like this, 2.23.37 and
>>> after that all these releases have "ea" in their name.
>>> >
>>> > "ea" stands for "early access". The project has changed the versioning
>>> / development model in such a way that "ea" releases act, more or less, as
>>> glorified snapshots which are indeed released to Maven Central but the
>>> "regular" releases are not there. The reason behind this is that "regular"
>>> releases are published only for customers who pay to the company behind
>>> this project and they offer commercial support for that.
>>> >
>>> > "regular" releases are meant to get all the bug fixes after "ea" is
>>> published and they are official stable releases. On the other hand "ea"
>>> releases are the ones where the development happens and every now and then,
>>> once the developers think that it is time to cut new 2.x, they just publish
>>> that privately.
>>> >
>>> > I was investigating how this all works here (3) and while they said
>>> that, I quote (4):
>>> >
>>> > "In my experience this is consumed by a large number of open source
>>> projects reliably (for our other artifacts too). This development/ea branch
>>> still goes through an extensive test suite prior to release. Releases from
>>> this branch will contain the latest features and bug fixes."
>>> >
>>> > I am not completely sure if we are OK with this. For the record, Mick
>>> is not overly comfortable with that and Brandon would prefer to just
>>> replace it / get rid of this dependency (comments / reasons / discussion
>>> from (5) to the end)
>>> >
>>> > The question is if we are OK with how things are and if we are then
>>> what are the rules when upgrading the version of this project in Cassandra
>>> in the context of "ea" versions they publish.
>>> >
>>> > If we are not OK with this, then the question is what we are going to
>>> replace it with.
>>> >
>>> > If we are going to replace it, I very briefly took a look and there is
>>> practically nothing out there which would hit all the buttons for us.
>>> Chronicle is just perfect for this job and I am not a fan of rewriting this
>>> at all.
>>> >
>>> > I would like to have this resolved because there is CEP-12 I plan to
>>> deliver and I hit this and I do not want to base that work on something we
>>> might eventually abandon. There are some ideas for CEP-12 how to bypass
>>> this without using Chronicle but I would like to firstly hear your opinion.
>>> >
>>> > Regards
>>> >
>>> > (1) https://github.com/OpenHFT/Chronicle-Queue
>>> > (2) https://repo1.maven.org/maven2/net/openhft/chronicle-core/
>>> > (3) https://github.com/OpenHFT/Chronicle-Core/issues/668
>>> > (4)
>>> https://github.com/OpenHFT/Chronicle-Core/issues/668#issuecomment-2322038676
>>> > (5)
>>> https://issues.apache.org/jira/browse/CASSANDRA-18712?focusedCommentId=17878254&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17878254
>>> >
>>> >
>>>
>>>
>>>

Re: [DISCUSS] Chronicle Queue's development model and a hypothetical replacement of the library

Reply via email to