It might be a fun experiment to retrofit the Harry tests we're currently using (once Harry 2.0 lands in trunk from cep-15-accord) to fuzz SAI and point them at legacy 2i (i.e. the subset of query types legacy 2i supports) and see if we find anything interesting, but I don't even know if that rises above something like CASSANDRA-19007 <https://issues.apache.org/jira/browse/CASSANDRA-19007> on the backlog of things around indexes/filtering I would fix...
On Tue, Dec 10, 2024 at 10:35 AM Benedict Elliott Smith <bened...@apache.org> wrote: > I agree with Aleksey on how we should approach feature flags, and if we > think 2i simply *don’t work* we should make that determination and mark > them *broken* not *deprecated*. > > The only bug mentioned so far is 18656, which doesn’t clearly argue that > the behaviour is *incorrect* rather than just undesired. The only > breaking scenario I can think of is if we complete a bootstrap before the > index build is complete. I am not sure if this is possible, but if it is we > should probably fix that, and in the meantime perhaps document the flaw and > describe work arounds (such as repairing after stopping a replica to be > replaced). This isn’t a “remove the feature” level bug though, given my > current understanding of it. If anything, it would be much more work than > just fixing the bug. > > If there’s a longer litany of breaking behaviours, let’s enumerate them > and consider marking the feature as unsafe. > > On 10 Dec 2024, at 10:29, Caleb Rackliffe <calebrackli...@gmail.com> > wrote: > > I think my point here is that the hidden table 2i implementation has known > correctness/availability/operational/resource usage issues whether it has a > theoretical niche use-case or not from a query performance perspective. > > To Štefan’s question, yes, more or less. I’d like to at least see some > success in production for the cases it was primarily designed for. That > might not be enough to make it the default if it needs to perform better > than the (broken) legacy 2i in global query situations. SAI is currently > bad by design for global queries across 1000s of SSTables (LCS), so it > would either need to be used in conjunction with a compaction strategy that > aggressively limits the number of live SSTables, otherwise modified to > handle that case better, or simply made the default w/ the guardrails it > already has around these things becuase there simply isn’t a usable > alternative. > > On Dec 10, 2024, at 9:13 AM, Benedict Elliott Smith <bened...@apache.org> > wrote: > > > > There is no reason it should ever be more capable than SAI for any > partition/token-restricted query use-case, and I don't really see how > there's any short-term path for any local 2i implementation in C* to be > efficient for anything else > > > While I am not personally aware of much evidence presented that SAI > performs better than 2i for the partition-restricted case, I do believe it > is theoretically likely to. But any deprecation discussion should include > evidence of this as a preamble. > > However, there are users that want queries *not* restricted by partition > or token, and SAI is unlikely to serve these use cases as well. Yes, > neither perform this use case *well*, but I cannot support deprecating a > feature when its replacement is very likely inferior for some workloads. > Since it is hard to prove that nobody is using 2i this way (and I recall > from the distant past that such users were known to exist), we need instead > to prove SAI can serve these workloads acceptably before we declare it a > suitable replacement. > > I think there exists a near future world where we can offer proper > *global* secondary indexes, at which point it would be acceptable to > deprecate 2i and recommend users switch to either global secondary indexes > or SAI. Until then, I cannot see a good argument for it if we want to be > considered a stable and mature product. > > > On 10 Dec 2024, at 09:28, Caleb Rackliffe <calebrackli...@gmail.com> > wrote: > > > I’m not convinced SAI has demonstrated a practical or theoretical > capability to fully replace secondary indexes anyway. So it would be very > premature to mark them deprecated. > > > If 2i indexes are to be marked as deprecated and SAI is beta, then what > is actually the index implementation we stand behind in the production? It > is like we are "abandoning" the former but the latter is not bullet-proof > yet. > > The table-based 2i implementation has never been safe to use, and I don't > think it ever will be, however we label it. (ex. CASSANDRA-18656, it's > on-disk bloat, post-streaming rebuilds, etc.) There is no reason it should > ever be more capable than SAI for any partition/token-restricted query > use-case, and I don't really see how there's any short-term path for any > local 2i implementation in C* to be efficient for anything else. There are > presently no feature gaps on the query side. > > Anyway, there are still a lot of things we can improve about SAI (and > things that already exist and are just waiting in the DS public fork)...I'm > just not sure what reasonable use case the old 2i will be able to serve > better. > > On Tue, Dec 10, 2024 at 5:41 AM Benedict <bened...@apache.org> wrote: > >> I’m not convinced SAI has demonstrated a practical or theoretical >> capability to fully replace secondary indexes anyway. So it would be very >> premature to mark them deprecated. >> >> On 10 Dec 2024, at 06:29, Štefan Miklošovič <smikloso...@apache.org> >> wrote: >> >> >> ... then we should NOT mark it to be deprecated. >> >> On Tue, Dec 10, 2024 at 12:27 PM Štefan Miklošovič < >> smikloso...@apache.org> wrote: >> >>> I have a hard time getting used to the "terminology" here. If 2i indexes >>> are to be marked as deprecated and SAI is beta, then what is actually the >>> index implementation we stand behind in the production? It is like we are >>> "abandoning" the former but the latter is not bullet-proof yet. The signal >>> it sends is that we don't have a non-deprecated bullet-proof index impl. >>> >>> Maybe it is just about the wording and people are just fine running >>> deprecated things knowing they are production-ready, what I am used to is >>> that if something is deprecated, then there is always a replacement which >>> is recommended. If there isn't a recommended replacement which can fully >>> superseed the current implementation then we should mark it to be >>> deprecated. >>> >>> I understand that you are trying to find some "common ground" / >>> expressing that we are moving towards SAI but I am not sure the wording is >>> entirely correct or we should be careful how we frame it. >>> >>> On Tue, Dec 10, 2024 at 12:01 PM Mick Semb Wever <m...@apache.org> wrote: >>> >>>> > A possibility with SAI is to mark it beta while also marking 2i as >>>> > deprecated (and leaving SASI as marked). This sends a clear signal >>>> > (imho) that SAI is the recommended solution forward but also being >>>> > honest about its maturity and QA. >>>> >>>> >>>> (and leaving SASI as marked *experimental*) >>>> >>> > >