Re: [DISCUSS] Chronicle Queue's development model and a hypothetical replacement of the library

Andrew Weaver Sun, 29 Sep 2024 11:45:49 -0700

I'm late to the discussion here, but I want to add my experience from
dealing with audit logs specifically.


Chronicle has some advantages (binary, compact) but it has a serious
disadvantage from a consumption standpoint. It's not a well-supported file
format. Audit logs are something that I think most operators are interested
in archiving for compliance purposes and analyzing offline for any number
of reasons and an oddball file format is an unnecessary hurdle for the
audit logs use-case.

I would welcome support for an existing format that is compact,
high-performance and compatible with common tools (Spark, etc.).

On Sun, Sep 29, 2024, 10:11 AM Štefan Miklošovič <smikloso...@apache.org>
wrote:

> Thank you all for your answers and opinions. I would like to have some
> kind of a resolution here in order to move forward, especially with
> relation to CEP-12 I mentioned earlier. (1).
>
> I think we have these options:
>
> 1) Do nothing and wait until this gets back to us, probably in a more
> serious way (we find a bug and we will not be able to update it because it
> would be on "ea" or new features will be available only in newer versions).
>
> 2) Fork it and continue to maintain it - I do not think this is realistic,
> nobody is going to take care of forking that and maintaining it long term.
>
> 3) Do nothing but refactor it in such a way that it will be easier to
> replace it with something else in the future. CEP-12 is not only adding
> persistence to diagnostic events but the patch I have also makes whole
> logging more robust. Even it is all on Chronicle Queues (FQL, Audit ...),
> there are some differences between that when it comes to the implementation
> and I think that refactoring it in such a way that it would have all clear
> class structure and hierarchy (bottom of CEP-12) we will have easier job if
> we ever go to replace that.
>
> 4) Proceed with CEP-12 even though we know we are building it on top of
> something which should not be there.
>
> 5) Do absolutely nothing until we replace it with something else and we
> get rid of what is there right now - that would mean that we will not
> benefit from the code which is easier to maintain etc (if CEP-12 is not
> going to materialize) which I think is a welcomed attribute of the code
> base to have.
>
> I was thinking more about stuff like protobuf and while I do see benefits
> of that, honestly, it just does not matter too much if it is done like that
> or not. I mean, sure, it would be cool to have, but we could spend a lot of
> effort on protobuf and integrating with it or on anything which would make
> the consumption of these events language-agnostic but these are quite niche
> scenarios and I think that time might be used somewhere else more
> effectively.
>
> The bottom line is that I am reluctant to do anything unless CEP-12 makes
> it in one way or another (either with diagnostic persistence or without it
> but with a nice refactoring) and, let's get real here, I do not think that
> anybody is going to spend any time on this particular piece of the
> functionality either. So the net result is that it will be either
> athrophying or we at least clean it up so whoever comes next has an easier
> job to replace it.
>
> (1)
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-12+Diagnostics+events+persistence+and+their+exposure+in+virtual+tables
>
> On Tue, Sep 24, 2024 at 6:11 PM Ariel Weisberg <ar...@weisberg.ws> wrote:
>
>> Hi,
>>
>> I just don't understand what "good enough performance" is.
>>
>> Should really specify throughput. There is a single thread writing
>> records to the log and it's a bottleneck around a few hundred thousand
>> entries/sec and 1gb/sec. It doesn't scale to arbitrary throughput
>> requirements.
>>
>> What is a "predictable footprint"? Was that measured too? How did we
>> quantify that?
>>
>> You can set a rolling cycle to limit the size of the log. It's not that
>> predictable disk space wise because rolling is time based, and that is one
>> of the things I don't like about Chronicle.
>>
>> This is interesting, if I understand correctly, the messages are weighted
>> and the heavier they are, the more probable it is they will be dropped when
>> it is overloaded? Or vice versa, the tighter ones are dropped first?
>>
>> It's still a FIFO queue. Elements aren't dropped from the queue they are
>> dropped by the producers who don't have to wait for the consumer of the
>> queue to catch up. The queue size is described in terms of weight not
>> number of elements so it can bound memory usage.
>>
>> Have we _ever_ experienced in production that some log events were really
>> dropped? Has anybody ever hit that?
>>
>> Dropping samples is off by default so it can be used in a lossless way.
>>
>> Notionally one of the use cases of full query logging is that you have a
>> cluster that is overloaded and want to find out what is causing it. These
>> nodes maybe low on IO/CPU and turning on the full query log could cause
>> additional timeouts so one goal of the full query log is that enabling it
>> shouldn't make things worse.
>>
>> That is the motivation for memory limits and not blocking request threads
>> on IO. Really there should also be rate limits and random sampling because
>> right now dropping samples will be biased towards dropping large footprint
>> samples.
>>
>> David Capwell mentioned some performance issues. I recall we talked about
>> it and I did a quick microbenchmark and didn't have a problem writing
>> records (1 gigabyte/sec, hundreds of thousands of entries) so I am not sure
>> what scenarios is where performance is bad and whether it is addressable.
>> Not sure it matters since Chronicle's approach to OSS is so problematic.
>>
>> Ariel
>>
>> On Tue, Sep 17, 2024, at 4:27 AM, Štefan Miklošovič wrote:
>>
>> to Benedict:
>>
>> well ... I was not around when the decision about the usage of Chronicle
>> Queues was made. I think that at that time it was the most obvious
>> candidate without reinventing the wheel given the features and capabilities
>> it had so taking something off the shelf was a natural conclusion.
>>
>> Josh / Jordan:
>>
>> not only FQL but Audit as well these are two separate things. There is
>> also quite a "rich" ecosystem around that.
>>
>> 1) nodetool commands like
>>
>> enableauditlog
>> enablefullquerylog
>> disableauditlog
>> disablefullquerylog
>> getauditlog
>> getfullquerylog
>>
>> Also, because the files it produces are binary, we need a special tooling
>> to inspect it, it is in tools/fqltool with a bunch of classes, and there is
>> also an AuditLogViewer for reviewing audit logs.
>>
>> There are MBean methods enabling nodetool commands.
>>
>> We have also shipped that in two major releases (4.0 and now in 5.0) so
>> the community is quite well used to this, they have the processes set
>> around this etc.
>>
>> I mention this all because it is just not so easy to replace it with
>> something else if somebody wanted that, in any case. How do we even go
>> around deprecating this if we are indeed going to replace that?
>>
>> To discuss the release aspect they have in place: I think you are right
>> that the latest ea is as close as possible, if not the same, as what they
>> release privately. Yes. But if we want to stick to the rule that we upgrade
>> only to the latest ea relese before their next minor, then
>>
>> 1) we will be always at least one minor late
>> 2) we do not know when they make up their minds to transition to a new
>> minor so we can upgrade to the latest ea one minor before
>> 3) if something is broken and we need to fix it and we are on ea, then
>> what we get to update to is the latest ea at that time which might fix the
>> issue but it will also bring new stuff in which might open doors to
>> instability as well. So we update to fix the bugs but we might include new
>> ones unknowingly.
>>
>> Anyway, I don't think this has any silver bullet solution, we might just
>> stick to the latest "ea" and be done with it. I do not expect this project
>> to evolve wildly and unpredictably, it just solves "one problem", there is
>> basically nothing new coming in.
>>
>> Brandon:
>>
>> I understand your concerns about phoning home but
>>
>> 1) we already resolved this by setting the respective property
>> 2) I do not think that Chronicle will mess with this once they introduce
>> that. There is nothing to "improve" or "change" there. It is phoning home
>> or not and it is driven by one property. If they made a change that we can
>> not turn it off then we would really be in trouble but for now we are not
>> and practically speaking I don't expect this would change.
>>
>> I know that this might sound like wishful thinking but in practical terms
>> I really just don't expect this phoning home thing would come back ever.
>>
>> Speaking of alternatives, I think the primary reason Chronicle was used
>> is this (1).
>>
>> "It's goal is good enough performance, predictable footprint, simplicity
>> in terms of implementation and configuration and most importantly minimal
>> impact on producers of log records."
>>
>> While I understand English (I guess, well enough :D), I just don't
>> understand what "good enough performance" is. How is this measured? What is
>> a "predictable footprint"? Was that measured too? How did we quantify that?
>>
>> " Performance safety is accomplished by feeding items to the binary log
>> using a weighted queue and dropping records if the binary log falls
>> sufficiently far behind."
>>
>> This is interesting, if I understand correctly, the messages are weighted
>> and the heavier they are, the more probable it is they will be dropped when
>> it is overloaded? Or vice versa, the tighter ones are dropped first?
>>
>> Have we _ever_ experienced in production that some log events were really
>> dropped? Has anybody ever hit that?
>>
>> When it comes to alternatives, what about logback + slf4j? It has
>> appenders where we want, it is sync / async, we can code some nio appender
>> too I guess, it logs it as text into a file so we do not need any special
>> tooling to review that. For tailing which Chronicle also offers, I guess
>> "tail -f that.log" just does the job? logback even rolls the files after
>> they are big enough so it rolls the files the same way after some
>> configured period / size as Chronicle does (It even compresses the logs).
>>
>> Do we log so much so that battle-tested logback is just absolutely not
>> enough for us? Come on, this is not a rocket science that we need to use a
>> library from the realm of "high frequency trading" to just append queries
>> and audit logs as they are executed. logback can handle the load we have
>> just fine imo ...
>>
>> Or maybe I am completely wrong and we just HAVE TO use Chronicle?
>>
>> (1)
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/binlog/BinLog.java#L58-L69
>>
>> On Tue, Sep 17, 2024 at 3:12 AM Brandon Williams <dri...@gmail.com>
>> wrote:
>>
>> My concern is that we have to keep making sure it's not phoning home(1,2).
>>
>> (1) https://issues.apache.org/jira/browse/CASSANDRA-18538
>> (2) https://issues.apache.org/jira/browse/CASSANDRA-19656
>>
>> Kind Regards,
>> Brandon
>>
>> On Mon, Sep 16, 2024 at 7:53 PM Josh McKenzie <jmcken...@apache.org>
>> wrote:
>> >
>> > I think it's FQLTool only right now; I bumped into it recently doing
>> the JDK21 compat work.
>> >
>> > I'm not concerned about current usage / dependency, but if our usage
>> expands this could start to become a problem and that's going to be a hard
>> thing to track and mange.
>> >
>> > So reading through those issues Stefan, I think it boils down to:
>> >
>> > The latest ea is code identical to the stable release
>> > Subsequent bugfixes get applied to the customer-only stable branch and
>> one release forward
>> > Projects running ea releases would need to cherry-pick those bugfixes
>> back or run on the next branch's ea, which could introduce the project to
>> API changes or other risks
>> >
>> > Assuming that's the case... blech. Our exposure is low, but that seems
>> like a real pain.
>> >
>> > On Mon, Sep 16, 2024, at 5:16 PM, Benedict wrote:
>> >
>> >
>> > Don’t we essentially just use it as a file format for storing a couple
>> of kinds of append-only data?
>> >
>> > I was never entirely clear on the value it brought to the project.
>> >
>> >
>> > On 16 Sep 2024, at 22:11, Jordan West <jw...@apache.org> wrote:
>> >
>> > 
>> > Thanks for the sleuthing Stefan! This definitely is a bit unfortunate.
>> It sounds like a replacement is not really practical so I'll ignore that
>> option for now, until a viable alternative is proposed. I am -1 on us
>> writing our own without strong, strong justification -- primarily because I
>> think the likelihood is we introduce more bugs before getting to something
>> stable.
>> >
>> > Regarding the remaining options, mostly some thoughts:
>> >
>> > - it would be nice to have some specific evidence of other projects
>> using the EA versions and what their developers have said about it.
>> > - it sounds like if we go with the EA route, the onus to test for
>> correctness / compatibility increases. They do test but anything marked
>> "early access" I think deserves more scrutiny from the C* community before
>> release. That could come in the form of more tests (or showing that we
>> already have good coverage of where its used).
>> > - i assume each time we upgrade we would pick the most recently
>> released EA version
>> >
>> > Jordan
>> >
>> >
>> > On Mon, Sep 16, 2024 at 1:46 PM Štefan Miklošovič <
>> smikloso...@apache.org> wrote:
>> >
>> > We are using a library called Chronicle Queue (1) and its dependencies
>> and we ship them in the distribution tarball.
>> >
>> > The version we use in 5.0 / trunk as I write this is 2.23.36. If you
>> look closely here (2), there is one more release like this, 2.23.37 and
>> after that all these releases have "ea" in their name.
>> >
>> > "ea" stands for "early access". The project has changed the versioning
>> / development model in such a way that "ea" releases act, more or less, as
>> glorified snapshots which are indeed released to Maven Central but the
>> "regular" releases are not there. The reason behind this is that "regular"
>> releases are published only for customers who pay to the company behind
>> this project and they offer commercial support for that.
>> >
>> > "regular" releases are meant to get all the bug fixes after "ea" is
>> published and they are official stable releases. On the other hand "ea"
>> releases are the ones where the development happens and every now and then,
>> once the developers think that it is time to cut new 2.x, they just publish
>> that privately.
>> >
>> > I was investigating how this all works here (3) and while they said
>> that, I quote (4):
>> >
>> > "In my experience this is consumed by a large number of open source
>> projects reliably (for our other artifacts too). This development/ea branch
>> still goes through an extensive test suite prior to release. Releases from
>> this branch will contain the latest features and bug fixes."
>> >
>> > I am not completely sure if we are OK with this. For the record, Mick
>> is not overly comfortable with that and Brandon would prefer to just
>> replace it / get rid of this dependency (comments / reasons / discussion
>> from (5) to the end)
>> >
>> > The question is if we are OK with how things are and if we are then
>> what are the rules when upgrading the version of this project in Cassandra
>> in the context of "ea" versions they publish.
>> >
>> > If we are not OK with this, then the question is what we are going to
>> replace it with.
>> >
>> > If we are going to replace it, I very briefly took a look and there is
>> practically nothing out there which would hit all the buttons for us.
>> Chronicle is just perfect for this job and I am not a fan of rewriting this
>> at all.
>> >
>> > I would like to have this resolved because there is CEP-12 I plan to
>> deliver and I hit this and I do not want to base that work on something we
>> might eventually abandon. There are some ideas for CEP-12 how to bypass
>> this without using Chronicle but I would like to firstly hear your opinion.
>> >
>> > Regards
>> >
>> > (1) https://github.com/OpenHFT/Chronicle-Queue
>> > (2) https://repo1.maven.org/maven2/net/openhft/chronicle-core/
>> > (3) https://github.com/OpenHFT/Chronicle-Core/issues/668
>> > (4)
>> https://github.com/OpenHFT/Chronicle-Core/issues/668#issuecomment-2322038676
>> > (5)
>> https://issues.apache.org/jira/browse/CASSANDRA-18712?focusedCommentId=17878254&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17878254
>> >
>> >
>>
>>
>>

Re: [DISCUSS] Chronicle Queue's development model and a hypothetical replacement of the library

Reply via email to