Strong +1 to the file format issue, and if we're building a wish list - it would be great if we could read the file format without pulling in cassandra-all. Long term, I'd love to see this for SSTables & Commit logs as well.
I've long been a fan of Gradle subprojects because it makes this kind of thing fairly easy. Jon On Sun, Sep 29, 2024 at 11:46 AM Andrew Weaver <andrewjwea...@gmail.com> wrote: > I'm late to the discussion here, but I want to add my experience from > dealing with audit logs specifically. > > Chronicle has some advantages (binary, compact) but it has a serious > disadvantage from a consumption standpoint. It's not a well-supported file > format. Audit logs are something that I think most operators are interested > in archiving for compliance purposes and analyzing offline for any number > of reasons and an oddball file format is an unnecessary hurdle for the > audit logs use-case. > > I would welcome support for an existing format that is compact, > high-performance and compatible with common tools (Spark, etc.). > > On Sun, Sep 29, 2024, 10:11 AM Štefan Miklošovič <smikloso...@apache.org> > wrote: > >> Thank you all for your answers and opinions. I would like to have some >> kind of a resolution here in order to move forward, especially with >> relation to CEP-12 I mentioned earlier. (1). >> >> I think we have these options: >> >> 1) Do nothing and wait until this gets back to us, probably in a more >> serious way (we find a bug and we will not be able to update it because it >> would be on "ea" or new features will be available only in newer versions). >> >> 2) Fork it and continue to maintain it - I do not think this is >> realistic, nobody is going to take care of forking that and maintaining it >> long term. >> >> 3) Do nothing but refactor it in such a way that it will be easier to >> replace it with something else in the future. CEP-12 is not only adding >> persistence to diagnostic events but the patch I have also makes whole >> logging more robust. Even it is all on Chronicle Queues (FQL, Audit ...), >> there are some differences between that when it comes to the implementation >> and I think that refactoring it in such a way that it would have all clear >> class structure and hierarchy (bottom of CEP-12) we will have easier job if >> we ever go to replace that. >> >> 4) Proceed with CEP-12 even though we know we are building it on top of >> something which should not be there. >> >> 5) Do absolutely nothing until we replace it with something else and we >> get rid of what is there right now - that would mean that we will not >> benefit from the code which is easier to maintain etc (if CEP-12 is not >> going to materialize) which I think is a welcomed attribute of the code >> base to have. >> >> I was thinking more about stuff like protobuf and while I do see benefits >> of that, honestly, it just does not matter too much if it is done like that >> or not. I mean, sure, it would be cool to have, but we could spend a lot of >> effort on protobuf and integrating with it or on anything which would make >> the consumption of these events language-agnostic but these are quite niche >> scenarios and I think that time might be used somewhere else more >> effectively. >> >> The bottom line is that I am reluctant to do anything unless CEP-12 makes >> it in one way or another (either with diagnostic persistence or without it >> but with a nice refactoring) and, let's get real here, I do not think that >> anybody is going to spend any time on this particular piece of the >> functionality either. So the net result is that it will be either >> athrophying or we at least clean it up so whoever comes next has an easier >> job to replace it. >> >> (1) >> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-12+Diagnostics+events+persistence+and+their+exposure+in+virtual+tables >> >> On Tue, Sep 24, 2024 at 6:11 PM Ariel Weisberg <ar...@weisberg.ws> wrote: >> >>> Hi, >>> >>> I just don't understand what "good enough performance" is. >>> >>> Should really specify throughput. There is a single thread writing >>> records to the log and it's a bottleneck around a few hundred thousand >>> entries/sec and 1gb/sec. It doesn't scale to arbitrary throughput >>> requirements. >>> >>> What is a "predictable footprint"? Was that measured too? How did we >>> quantify that? >>> >>> You can set a rolling cycle to limit the size of the log. It's not that >>> predictable disk space wise because rolling is time based, and that is one >>> of the things I don't like about Chronicle. >>> >>> This is interesting, if I understand correctly, the messages are >>> weighted and the heavier they are, the more probable it is they will be >>> dropped when it is overloaded? Or vice versa, the tighter ones are dropped >>> first? >>> >>> It's still a FIFO queue. Elements aren't dropped from the queue they are >>> dropped by the producers who don't have to wait for the consumer of the >>> queue to catch up. The queue size is described in terms of weight not >>> number of elements so it can bound memory usage. >>> >>> Have we _ever_ experienced in production that some log events were >>> really dropped? Has anybody ever hit that? >>> >>> Dropping samples is off by default so it can be used in a lossless way. >>> >>> Notionally one of the use cases of full query logging is that you have a >>> cluster that is overloaded and want to find out what is causing it. These >>> nodes maybe low on IO/CPU and turning on the full query log could cause >>> additional timeouts so one goal of the full query log is that enabling it >>> shouldn't make things worse. >>> >>> That is the motivation for memory limits and not blocking request >>> threads on IO. Really there should also be rate limits and random sampling >>> because right now dropping samples will be biased towards dropping large >>> footprint samples. >>> >>> David Capwell mentioned some performance issues. I recall we talked >>> about it and I did a quick microbenchmark and didn't have a problem writing >>> records (1 gigabyte/sec, hundreds of thousands of entries) so I am not sure >>> what scenarios is where performance is bad and whether it is addressable. >>> Not sure it matters since Chronicle's approach to OSS is so problematic. >>> >>> Ariel >>> >>> On Tue, Sep 17, 2024, at 4:27 AM, Štefan Miklošovič wrote: >>> >>> to Benedict: >>> >>> well ... I was not around when the decision about the usage of Chronicle >>> Queues was made. I think that at that time it was the most obvious >>> candidate without reinventing the wheel given the features and capabilities >>> it had so taking something off the shelf was a natural conclusion. >>> >>> Josh / Jordan: >>> >>> not only FQL but Audit as well these are two separate things. There is >>> also quite a "rich" ecosystem around that. >>> >>> 1) nodetool commands like >>> >>> enableauditlog >>> enablefullquerylog >>> disableauditlog >>> disablefullquerylog >>> getauditlog >>> getfullquerylog >>> >>> Also, because the files it produces are binary, we need a special >>> tooling to inspect it, it is in tools/fqltool with a bunch of classes, and >>> there is also an AuditLogViewer for reviewing audit logs. >>> >>> There are MBean methods enabling nodetool commands. >>> >>> We have also shipped that in two major releases (4.0 and now in 5.0) so >>> the community is quite well used to this, they have the processes set >>> around this etc. >>> >>> I mention this all because it is just not so easy to replace it with >>> something else if somebody wanted that, in any case. How do we even go >>> around deprecating this if we are indeed going to replace that? >>> >>> To discuss the release aspect they have in place: I think you are right >>> that the latest ea is as close as possible, if not the same, as what they >>> release privately. Yes. But if we want to stick to the rule that we upgrade >>> only to the latest ea relese before their next minor, then >>> >>> 1) we will be always at least one minor late >>> 2) we do not know when they make up their minds to transition to a new >>> minor so we can upgrade to the latest ea one minor before >>> 3) if something is broken and we need to fix it and we are on ea, then >>> what we get to update to is the latest ea at that time which might fix the >>> issue but it will also bring new stuff in which might open doors to >>> instability as well. So we update to fix the bugs but we might include new >>> ones unknowingly. >>> >>> Anyway, I don't think this has any silver bullet solution, we might just >>> stick to the latest "ea" and be done with it. I do not expect this project >>> to evolve wildly and unpredictably, it just solves "one problem", there is >>> basically nothing new coming in. >>> >>> Brandon: >>> >>> I understand your concerns about phoning home but >>> >>> 1) we already resolved this by setting the respective property >>> 2) I do not think that Chronicle will mess with this once they introduce >>> that. There is nothing to "improve" or "change" there. It is phoning home >>> or not and it is driven by one property. If they made a change that we can >>> not turn it off then we would really be in trouble but for now we are not >>> and practically speaking I don't expect this would change. >>> >>> I know that this might sound like wishful thinking but in practical >>> terms I really just don't expect this phoning home thing would come back >>> ever. >>> >>> Speaking of alternatives, I think the primary reason Chronicle was used >>> is this (1). >>> >>> "It's goal is good enough performance, predictable footprint, simplicity >>> in terms of implementation and configuration and most importantly minimal >>> impact on producers of log records." >>> >>> While I understand English (I guess, well enough :D), I just don't >>> understand what "good enough performance" is. How is this measured? What is >>> a "predictable footprint"? Was that measured too? How did we quantify that? >>> >>> " Performance safety is accomplished by feeding items to the binary log >>> using a weighted queue and dropping records if the binary log falls >>> sufficiently far behind." >>> >>> This is interesting, if I understand correctly, the messages are >>> weighted and the heavier they are, the more probable it is they will be >>> dropped when it is overloaded? Or vice versa, the tighter ones are dropped >>> first? >>> >>> Have we _ever_ experienced in production that some log events were >>> really dropped? Has anybody ever hit that? >>> >>> When it comes to alternatives, what about logback + slf4j? It has >>> appenders where we want, it is sync / async, we can code some nio appender >>> too I guess, it logs it as text into a file so we do not need any special >>> tooling to review that. For tailing which Chronicle also offers, I guess >>> "tail -f that.log" just does the job? logback even rolls the files after >>> they are big enough so it rolls the files the same way after some >>> configured period / size as Chronicle does (It even compresses the logs). >>> >>> Do we log so much so that battle-tested logback is just absolutely not >>> enough for us? Come on, this is not a rocket science that we need to use a >>> library from the realm of "high frequency trading" to just append queries >>> and audit logs as they are executed. logback can handle the load we have >>> just fine imo ... >>> >>> Or maybe I am completely wrong and we just HAVE TO use Chronicle? >>> >>> (1) >>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/binlog/BinLog.java#L58-L69 >>> >>> On Tue, Sep 17, 2024 at 3:12 AM Brandon Williams <dri...@gmail.com> >>> wrote: >>> >>> My concern is that we have to keep making sure it's not phoning >>> home(1,2). >>> >>> (1) https://issues.apache.org/jira/browse/CASSANDRA-18538 >>> (2) https://issues.apache.org/jira/browse/CASSANDRA-19656 >>> >>> Kind Regards, >>> Brandon >>> >>> On Mon, Sep 16, 2024 at 7:53 PM Josh McKenzie <jmcken...@apache.org> >>> wrote: >>> > >>> > I think it's FQLTool only right now; I bumped into it recently doing >>> the JDK21 compat work. >>> > >>> > I'm not concerned about current usage / dependency, but if our usage >>> expands this could start to become a problem and that's going to be a hard >>> thing to track and mange. >>> > >>> > So reading through those issues Stefan, I think it boils down to: >>> > >>> > The latest ea is code identical to the stable release >>> > Subsequent bugfixes get applied to the customer-only stable branch and >>> one release forward >>> > Projects running ea releases would need to cherry-pick those bugfixes >>> back or run on the next branch's ea, which could introduce the project to >>> API changes or other risks >>> > >>> > Assuming that's the case... blech. Our exposure is low, but that seems >>> like a real pain. >>> > >>> > On Mon, Sep 16, 2024, at 5:16 PM, Benedict wrote: >>> > >>> > >>> > Don’t we essentially just use it as a file format for storing a couple >>> of kinds of append-only data? >>> > >>> > I was never entirely clear on the value it brought to the project. >>> > >>> > >>> > On 16 Sep 2024, at 22:11, Jordan West <jw...@apache.org> wrote: >>> > >>> > >>> > Thanks for the sleuthing Stefan! This definitely is a bit unfortunate. >>> It sounds like a replacement is not really practical so I'll ignore that >>> option for now, until a viable alternative is proposed. I am -1 on us >>> writing our own without strong, strong justification -- primarily because I >>> think the likelihood is we introduce more bugs before getting to something >>> stable. >>> > >>> > Regarding the remaining options, mostly some thoughts: >>> > >>> > - it would be nice to have some specific evidence of other projects >>> using the EA versions and what their developers have said about it. >>> > - it sounds like if we go with the EA route, the onus to test for >>> correctness / compatibility increases. They do test but anything marked >>> "early access" I think deserves more scrutiny from the C* community before >>> release. That could come in the form of more tests (or showing that we >>> already have good coverage of where its used). >>> > - i assume each time we upgrade we would pick the most recently >>> released EA version >>> > >>> > Jordan >>> > >>> > >>> > On Mon, Sep 16, 2024 at 1:46 PM Štefan Miklošovič < >>> smikloso...@apache.org> wrote: >>> > >>> > We are using a library called Chronicle Queue (1) and its dependencies >>> and we ship them in the distribution tarball. >>> > >>> > The version we use in 5.0 / trunk as I write this is 2.23.36. If you >>> look closely here (2), there is one more release like this, 2.23.37 and >>> after that all these releases have "ea" in their name. >>> > >>> > "ea" stands for "early access". The project has changed the versioning >>> / development model in such a way that "ea" releases act, more or less, as >>> glorified snapshots which are indeed released to Maven Central but the >>> "regular" releases are not there. The reason behind this is that "regular" >>> releases are published only for customers who pay to the company behind >>> this project and they offer commercial support for that. >>> > >>> > "regular" releases are meant to get all the bug fixes after "ea" is >>> published and they are official stable releases. On the other hand "ea" >>> releases are the ones where the development happens and every now and then, >>> once the developers think that it is time to cut new 2.x, they just publish >>> that privately. >>> > >>> > I was investigating how this all works here (3) and while they said >>> that, I quote (4): >>> > >>> > "In my experience this is consumed by a large number of open source >>> projects reliably (for our other artifacts too). This development/ea branch >>> still goes through an extensive test suite prior to release. Releases from >>> this branch will contain the latest features and bug fixes." >>> > >>> > I am not completely sure if we are OK with this. For the record, Mick >>> is not overly comfortable with that and Brandon would prefer to just >>> replace it / get rid of this dependency (comments / reasons / discussion >>> from (5) to the end) >>> > >>> > The question is if we are OK with how things are and if we are then >>> what are the rules when upgrading the version of this project in Cassandra >>> in the context of "ea" versions they publish. >>> > >>> > If we are not OK with this, then the question is what we are going to >>> replace it with. >>> > >>> > If we are going to replace it, I very briefly took a look and there is >>> practically nothing out there which would hit all the buttons for us. >>> Chronicle is just perfect for this job and I am not a fan of rewriting this >>> at all. >>> > >>> > I would like to have this resolved because there is CEP-12 I plan to >>> deliver and I hit this and I do not want to base that work on something we >>> might eventually abandon. There are some ideas for CEP-12 how to bypass >>> this without using Chronicle but I would like to firstly hear your opinion. >>> > >>> > Regards >>> > >>> > (1) https://github.com/OpenHFT/Chronicle-Queue >>> > (2) https://repo1.maven.org/maven2/net/openhft/chronicle-core/ >>> > (3) https://github.com/OpenHFT/Chronicle-Core/issues/668 >>> > (4) >>> https://github.com/OpenHFT/Chronicle-Core/issues/668#issuecomment-2322038676 >>> > (5) >>> https://issues.apache.org/jira/browse/CASSANDRA-18712?focusedCommentId=17878254&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17878254 >>> > >>> > >>> >>> >>>