Hi, Thanks Philip.
For items in went poorly, is there anything we could change or should have done differently to get contributor branches running in CI? I don’t recall us setting a fixed goal for when that had to be done. There are two angles to evaluate this from and decide if there is a problem. We had a schedule goal and we didn’t make it. Or we didn’t have a schedule goal, but we committed resources and didn’t make progress. If it's just not done and we are OK with that it wouldn't go under went poorly (it's just how it went). *Retrospecting the retrospective* The retrospective didn’t play out the way I hoped and I think it may be because I didn’t communicate how we are supposed to use it in enough detail. For everyone to have nothing to add in the went poorly column means that in the period of time covered by the retrospective we shipped no bug fixes for bugs that we think we should have been caught before release. We know that isn’t true because in JIRA for 2.1.3 there were 114 resolved bugs. Some of those are going to be things that were addressed before being released, but a good chunk have to be fixes for issues in previous releases. The reason we need to do this through a retrospective is that it’s a lot of work for me and I lack perspective on what is going on with each individual issue. Having a PHB chase everyone down and find out what they are doing is inefficient and doesn’t work because it’s lacks repeatability/scalability and it builds experienced for me, but not for the team. *Evaluating 2.13 issues:* Right now we have what you might call a target rich environment. Many features aren’t exercised in a realistic way or evaluated by success criteria (performance, space utilization) that users care about. The first few retrospectives should be the noisiest as we work through that backlog. We may be creating regression tests as part of the bug fixes we shipped (I sure hope we did), but regression tests are != to the kind of tests we could have written to avoid shipping the bug in the first place. Tests which have the potential to catch more than just the individual bug in question. I’m going to drag into the went poorly category fixes from 2.1.3 and I would like to have the involved parties (fundamentally this is the responsibility of the assignee) chime in on why the bug was released in the first place and what we could kind of test we could have done before release to catch it. I removed a number of issues that were enhancements, bugs fixed before release due to successful process, or wontfixed/not reproducable. Reasons for revisiting are typically a missing regression test, missing test that we could run now to detect this class of problem in the future, and most importantly (and also hardest to nail down), what could we do in the future when doing that kind of work to build effective tests before release. One of things I pick on for some of these is inadequate testing of boundary conditions, inadequate testing with interrelated components (always hard to identify and can change over time). Not testing under sufficient load or with a representative data model or data set is also an issue for some of these. *Homework:* If you are listed as an assignee you need to triage the ticket. Based on the experience with that bug are we doing sufficient testing now, and what kind of testing could have been done before release to find the issue without the benefit of hindsight. Every issue needs a response even if the response is "no work to be done." If there is work to be done it has to find its way into our testing strategy (submit a JIRA, or bring it up here). *Went poorly:* *Key* *Assignee* *Summary* *Revisit reason* CASSANDRA-7538 <https://issues.apache.org/jira/browse/CASSANDRA-7538> Sam Tunnicliffe Truncate of a CF should also delete Paxos CF Truncate not tested with PAXOS, what else? CASSANDRA-7704 <https://issues.apache.org/jira/browse/CASSANDRA-7704> Benedict FileNotFoundException during STREAM-OUT triggers 100% CPU usage Streaming testing didn't reproduce this before release CASSANDRA-7801 <https://issues.apache.org/jira/browse/CASSANDRA-7801> Sylvain Lebresne A successful INSERT with CAS does not always store data in the DB after a DELETE Multiple access paths for data not tested together CASSANDRA-7910 <https://issues.apache.org/jira/browse/CASSANDRA-7910> Tyler Hobbs wildcard prepared statements are incorrect after a column is added to the table Alter table not tested concurrently with ? CASSANDRA-8018 <https://issues.apache.org/jira/browse/CASSANDRA-8018> Benjamin Lerer Cassandra seems to insert twice in custom PerColumnSecondaryIndex Custom secondary indexes not tested before release? CASSANDRA-8028 <https://issues.apache.org/jira/browse/CASSANDRA-8028> Carl Yeksigian Unable to compute when histogram overflowed Histogram output not tested with representative data sets, no regression test CASSANDRA-8122 <https://issues.apache.org/jira/browse/CASSANDRA-8122> Carl Yeksigian Undeclare throwable exception while executing 'nodetool netstats localhost' nodetool not tested against cluster throughout lifecycle, no regression test CASSANDRA-8211 <https://issues.apache.org/jira/browse/CASSANDRA-8211> Marcus Eriksson Overlapping sstables in L1+ Noted hard to reproduce, but still is there a way we could have, no regression test CASSANDRA-8231 <https://issues.apache.org/jira/browse/CASSANDRA-8231> Benjamin Lerer Wrong size of cached prepared statements Expected cache capacity not validated with actual cache capcaity, no regression test CASSANDRA-8243 <https://issues.apache.org/jira/browse/CASSANDRA-8243> Björn Hegerfors DTCS can leave time-overlaps, limiting ability to expire entire SSTables Performance improving fast path not tested in a representative way CASSANDRA-8264 <https://issues.apache.org/jira/browse/CASSANDRA-8264> Tyler Hobbs Problems with multicolumn relations and COMPACT STORAGE How can we catch interactions like compact storage not being covered by the test CASSANDRA-8280 <https://issues.apache.org/jira/browse/CASSANDRA-8280> Sam Tunnicliffe Cassandra crashing on inserting data over 64K into indexed strings Added tests are good example, could focusing on testing all access paths and boundary conditions per access path have prevented this CASSANDRA-8285 <https://issues.apache.org/jira/browse/CASSANDRA-8285> Aleksey Yeschenko Move all hints related tasks to hints private executor Pierre's reproducer represents something we weren't doing, but that users are. Is that now being tested? CASSANDRA-8286 <https://issues.apache.org/jira/browse/CASSANDRA-8286> Tyler Hobbs Regression in ORDER BY There were tests that failed in some versions, but not all? Did this not ship? CASSANDRA-8288 <https://issues.apache.org/jira/browse/CASSANDRA-8288> Tyler Hobbs cqlsh describe needs to show 'sstable_compression': '' Roundtrip test for describe schema? CASSANDRA-8292 <https://issues.apache.org/jira/browse/CASSANDRA-8292> Joshua McKenzie From Pig: org.apache.cassandra.exceptions.ConfigurationException: Expecting URI in variable: [cassandra.config]. Please prefix the file with file:/// for local files or file://<server>/ for remote files. PIG not tested CASSANDRA-8302 <https://issues.apache.org/jira/browse/CASSANDRA-8302> Tyler Hobbs Filtering for CONTAINS (KEY) on frozen collection clustering columns within a partition does not work More untested combinations, could we have spotted that there was an interaction and tested it? Or did this not ship? CASSANDRA-8316 <https://issues.apache.org/jira/browse/CASSANDRA-8316> Marcus Eriksson "Did not get positive replies from all endpoints" error on incremental repair What were users doing differently, is there a reproducer for this running now? CASSANDRA-8320 <https://issues.apache.org/jira/browse/CASSANDRA-8320> Marcus Eriksson 2.1.2: NullPointerException in SSTableWriter What were users doing that caused this, are we doing that? CASSANDRA-8332 <https://issues.apache.org/jira/browse/CASSANDRA-8332> T Jake Luciani Null pointer after droping keyspace Add/drop keyspace not tested under load, with server logs checked for errors CASSANDRA-8365 <https://issues.apache.org/jira/browse/CASSANDRA-8365> Benjamin Lerer CamelCase name is used as index name instead of lowercase How can we establish UI consistency? CASSANDRA-8370 <https://issues.apache.org/jira/browse/CASSANDRA-8370> Sam Tunnicliffe cqlsh doesn't handle LIST statements correctly cqlsh untested functionality, no regression test? CASSANDRA-8383 <https://issues.apache.org/jira/browse/CASSANDRA-8383> Benedict Memtable flush may expire records from the commit log that are in a later memtable No regression test, no follow up ticket. Could/should this have been reproducable as an actual bug? CASSANDRA-8386 <https://issues.apache.org/jira/browse/CASSANDRA-8386> Marcus Eriksson Make sure we release references to sstables after incremental repair Is there a higher level test that could have observed this failure? CASSANDRA-8401 <https://issues.apache.org/jira/browse/CASSANDRA-8401> Jonathan Ellis dropping a CF doesn't remove the latency-sampling task Another argument for a schema change stress test, maybe tracking for constant memory utilization CASSANDRA-8408 <https://issues.apache.org/jira/browse/CASSANDRA-8408> Tyler Hobbs limit appears to replace page size under certain conditions No test that validates that paging returns the expected number of results? Another of the genre of queries we support but don't test all the combinations CASSANDRA-8410 <https://issues.apache.org/jira/browse/CASSANDRA-8410> Tyler Hobbs Select with many IN values on clustering columns can result in a StackOverflowError Another missing boundary conditions test, test maximum size in clause against * CASSANDRA-8421 <https://issues.apache.org/jira/browse/CASSANDRA-8421> Benjamin Lerer Cassandra 2.1.1 & Cassandra 2.1.2 UDT not returning value for LIST type as UDT Is there a test that could have found this condition before release? CASSANDRA-8429 <https://issues.apache.org/jira/browse/CASSANDRA-8429> Benedict Some keys unreadable during compaction Running stress in CI would have caught this, and we're going to do that CASSANDRA-8432 <https://issues.apache.org/jira/browse/CASSANDRA-8432> Marcus Eriksson Standalone Scrubber broken for LCS Standalone scrubber not tested, no regression test CASSANDRA-8448 <https://issues.apache.org/jira/browse/CASSANDRA-8448> Brandon Williams "Comparison method violates its general contract" in AbstractEndpointSnitch This just happens periodically? Was the snitch no tested under load and the log output checked for errors? CASSANDRA-8451 <https://issues.apache.org/jira/browse/CASSANDRA-8451> Tyler Hobbs NPE when writetime() or ttl() are nested inside function call Is this testable? Can we check that functions compose correctly or validate that they are inherently composable. No regression test. CASSANDRA-8458 <https://issues.apache.org/jira/browse/CASSANDRA-8458> Marcus Eriksson Don't give out positions in an sstable beyond its first/last tokens Streaming not done in realistic scenario with validation of logging CASSANDRA-8459 <https://issues.apache.org/jira/browse/CASSANDRA-8459> Benedict "autocompaction" on reads can prevent memtable space reclaimation What would have reproduced this before release? CASSANDRA-8462 <https://issues.apache.org/jira/browse/CASSANDRA-8462> Aleksey Yeschenko Upgrading a 2.0 to 2.1 breaks CFMetaData on 2.0 nodes Have additional dtest coverage, need to do this in kitchen sink tests CASSANDRA-8463 <https://issues.apache.org/jira/browse/CASSANDRA-8463> Marcus Eriksson Constant compaction under LCS What would have reproduced this before release? CASSANDRA-8490 <https://issues.apache.org/jira/browse/CASSANDRA-8490> Tyler Hobbs DISTINCT queries with LIMITs or paging are incorrect when partitions are deleted Untested query forms, no regression test CASSANDRA-8499 <https://issues.apache.org/jira/browse/CASSANDRA-8499> Benedict Ensure SSTableWriter cleans up properly after failure Testing error paths? Any way to test things in a loop to detect leaks? CASSANDRA-8510 <https://issues.apache.org/jira/browse/CASSANDRA-8510> Marcus Eriksson CompactionManager.submitMaximal may leak resources Not a user visible problem, so difficult to catch in test, but is there a way CASSANDRA-8512 <https://issues.apache.org/jira/browse/CASSANDRA-8512> Tyler Hobbs cqlsh unusable after encountering schema mismatch cqlsh not tested with other functionality active CASSANDRA-8513 <https://issues.apache.org/jira/browse/CASSANDRA-8513> Benedict SSTableScanner may not acquire reference, but will still release it when closed This had a user visible component, what test could have caught it befor erelease? CASSANDRA-8514 <https://issues.apache.org/jira/browse/CASSANDRA-8514> Benjamin Lerer ArrayIndexOutOfBoundsException in nodetool cfhistograms Not released, but not caught by automated tests either CASSANDRA-8525 <https://issues.apache.org/jira/browse/CASSANDRA-8525> Marcus Eriksson Bloom Filter truePositive counter not updated on key cache hit User visible metric not accurate, but only in one config. Possible to guess correct FP ratio and validate while exploring config space? CASSANDRA-8532 <https://issues.apache.org/jira/browse/CASSANDRA-8532> Marcus Eriksson Fix calculation of expected write size during compaction Did this manifest as a user visible issue, could we have tested for that? CASSANDRA-8537 <https://issues.apache.org/jira/browse/CASSANDRA-8537> Marcus Eriksson ConcurrentModificationException while executing 'nodetool cleanup' Nodetool cleanup not tested before release CASSANDRA-8550 <https://issues.apache.org/jira/browse/CASSANDRA-8550> Tyler Hobbs Internal pagination in CQL3 index queries creating substantial overhead Pagination not performance tested with representative data models CASSANDRA-8558 <https://issues.apache.org/jira/browse/CASSANDRA-8558> Sylvain Lebresne deleted row still can be selected out Validate that deleted data stays deleted under * conditions (big matrix of interactions here with different configurations, streaming, repair, cleanup, scrub). Deleted data coming back shows up a lot. CASSANDRA-8562 <https://issues.apache.org/jira/browse/CASSANDRA-8562> Marcus Eriksson Fix checking available disk space before compaction starts Is there a user visible negative impact, could it have been tested for? CASSANDRA-8563 <https://issues.apache.org/jira/browse/CASSANDRA-8563> Tyler Hobbs cqlsh broken for some thrift created tables. Validate mixed CQL thrift interactions? Possibly abstract everything to be done either by CQL or Thrift and then permute? Seems low value, but necessary if both are claimed to be supported. CASSANDRA-8577 <https://issues.apache.org/jira/browse/CASSANDRA-8577> Artem Aliev Values of set types not loading correctly into Pig Full set of interactions with PIG not validated CASSANDRA-8579 <https://issues.apache.org/jira/browse/CASSANDRA-8579> Jimmy Mårdell sstablemetadata can't load org.apache.cassandra.tools.SSTableMetadataViewer Running C* from source tree not representative of behavior of deployed builds CASSANDRA-8580 <https://issues.apache.org/jira/browse/CASSANDRA-8580> Marcus Eriksson AssertionErrors after activating unchecked_tombstone_compaction with leveled compaction How could this have been reproduced before release? No regression test CASSANDRA-8588 <https://issues.apache.org/jira/browse/CASSANDRA-8588> Dave Brosius Fix DropTypeStatements isusedBy for maps (typo ignored values) Not released, but was it detected before release by an automated test? CASSANDRA-8619 <https://issues.apache.org/jira/browse/CASSANDRA-8619> Benedict using CQLSSTableWriter gives ConcurrentModificationException What kind of test would have caught this before release? CASSANDRA-8623 <https://issues.apache.org/jira/browse/CASSANDRA-8623> Marcus Eriksson sstablesplit fails *randomly* with Data component is missing Feature not tested before release? No regression test CASSANDRA-8632 <https://issues.apache.org/jira/browse/CASSANDRA-8632> Benedict cassandra-stress only generating a single unique row We rely on stress for performance testing, that might mean it needs real testing that demonstrates it generates load that looks like the load it is supposed to be generating. CASSANDRA-8635 <https://issues.apache.org/jira/browse/CASSANDRA-8635> Marcus Eriksson STCS cold sstable omission does not handle overwrites without reads If this workload is a challenge for certain kinds of optimizations we should test it if we think it could happen again. CASSANDRA-8640 <https://issues.apache.org/jira/browse/CASSANDRA-8640> Anthony Cozzie Paxos requires all nodes for CAS If PAXOS is not supposed to require all nodes for CAS we should be able to fail nodes or a certain number of nodes and still continue to CAS (test availability of CAS under failure conditions). No regression test. CASSANDRA-8641 <https://issues.apache.org/jira/browse/CASSANDRA-8641> *Unassigned* Repair causes a large number of tiny SSTables User says something doesn't work for them? Could we have anticipated that vnodes would not work as formulated for this case. CASSANDRA-8652 <https://issues.apache.org/jira/browse/CASSANDRA-8652> Edward Ribeiro DROP TABLE should also drop BATCH prepared statements associated to it Not sure if this is an optimization or fixes a user visible issue, but could this have been detected by exercising the functionality better before release. CASSANDRA-8668 <https://issues.apache.org/jira/browse/CASSANDRA-8668> Benedict We don't enforce offheap memory constraints; regression introduced by 7882 Memory constraints was a supported feature/UI, but not completely tested before release. Could this have been found most effectively by a unit test or a blackbox test? CASSANDRA-8675 <https://issues.apache.org/jira/browse/CASSANDRA-8675> *Unassigned* COPY TO/FROM broken for newline characters COPY TO/FROM not tested with representative data CASSANDRA-8677 <https://issues.apache.org/jira/browse/CASSANDRA-8677> Ariel Weisberg rpc_interface and listen_interface generate NPE on startup when specified interface doesn't exist Missing unit tests checking error messages for DatabaseDescriptor CASSANDRA-8687 <https://issues.apache.org/jira/browse/CASSANDRA-8687> Jeremiah Jordan Keyspace should also check Config.isClientMode Is there a way to test for missing Config.isClientMode checks? CASSANDRA-8688 <https://issues.apache.org/jira/browse/CASSANDRA-8688> Yuki Morishita Standalone sstableupgrade tool throws exception Tool not tested before release, no regression test CASSANDRA-8691 <https://issues.apache.org/jira/browse/CASSANDRA-8691> *Unassigned* SSTableReader.getPosition() does not correctly filter out queries that exceed its bounds Is there a scenario where this is user visible, should we test for that? CASSANDRA-8694 <https://issues.apache.org/jira/browse/CASSANDRA-8694> Jeff Jirsa Repair of empty keyspace hangs rather than ignoring the request Missing boundary condition test, requesting operation on empty, non-existent, or not applicable entity. CASSANDRA-8695 <https://issues.apache.org/jira/browse/CASSANDRA-8695> Chris Lockfort thrift column definition list sometimes immutable What user visible activities reproduced this, could we have done that before release? CASSANDRA-8719 <https://issues.apache.org/jira/browse/CASSANDRA-8719> Benedict Using thrift HSHA with offheap_objects appears to corrupt data Untested configuration before release, this would be straightforward if we ran with it? CASSANDRA-8726 <https://issues.apache.org/jira/browse/CASSANDRA-8726> Benedict throw OOM in Memory if we fail to allocate OOM test Cassandra? Try and validate that it fails cleanly and can be restarted on OOM? Same for disk full. CASSANDRA-8733 <https://issues.apache.org/jira/browse/CASSANDRA-8733> Tyler Hobbs List prepend reverses item order There was a test so sometimes this just happens. Thanks, Ariel On Apr 2, 2015, at 5:21 PM, Philip Thompson <philip.thomp...@datastax.com> wrote: To add to this: *Went well* Tyler Hobbs has reduced failing dtests on trunk by ~90%. By next month, test results should be at 100% pass. *Went poorly* We've failed to make progress on running the full test suite across all contributor branches. By the end of this month, I assume we will at least have limited functionality in this area. On Wed, Apr 1, 2015 at 3:57 PM, Ariel Weisberg <ariel.weisb...@datastax.com> wrote: Hi all, It’s time for the first retrospective. For those not familiar this is the part of the development process where we discuss what is and isn’t working when it comes to making reliable releases. We go over the things that worked, the things that didn’t work, and what changes we are going to make. This is not a forum for discussing individual bugs (or bugs fixed before release due to successful process) although you can cite one and we can discuss what we could have done differently to catch it. Even if a bug wasn’t released if it was caught the wrong way (blind luck) and you think our process wouldn’t have caught it you can bring that up as well. I don’t expect this retrospective to be the most productive because we already know we are far behind in several areas (passing utests, dtests, running utests and dtests for on each commit, running a larger black box system test) and many issues will circle back around to being addressed by one of those three. If your a developer you can review all things you have committed (or reviewed) in the past month and ask yourself if it met the criteria of done that we agreed on including adding tests for existing untested code (usually the thing missed). Better to do it now then after discovering your definition of done was flawed because it released a preventible bug. For this one retrospective you can reach back further to something already released that you feel passionate about, and if you can point to a utest or dtest that should have caught it that is still missing we can add that to the list of things to test. That would go under CASSANDRA-9012 (Triage missing test coverage) < https://issues.apache.org/jira/browse/CASSANDRA-9012>. There is a root JIRA <https://issues.apache.org/jira/browse/CASSANDRA-9042> for making trunk always releasable. A lot falls under CASSANDRA-9007 ( Run stress nightly against trunk in a way that validates ) < https://issues.apache.org/jira/browse/CASSANDRA-9007> which is the root for a new kitchen sink style test that validates the entire feature set together in a black box fashion. Philip Thompson has a basic job running so we are close to (or at) the tipping point where the doneness criteria for every ticket needs to include making sure this job covers the thing you added/changed. If you aren’t going to add the coverage you need to justify (to yourself and your reviewer) breaking it out into something separate and file a JIRA indicating the coverage was missing (if one doesn’t already exist). Make sure to link it to 9007 so we can see what has already been reported. The reason I say we might not be at the tipping point is that while we have the job we haven’t ironed out how stress (or something new) will act as a container for validating multiple features. Especially in an environment where things like cluster/node failures and topology changes occur. Retrospectives aren’t supposed to include the preceding paragraphs we should funnel discussion about them into a separate email thread. On to the retrospective. This is more for me to solicit from information from you then for me to push information to you. Went well Positive response to the definition of done Lot’s of manpower from QA and progress on test infrastructure Went poorly Some wanting to add validation to a kitchen sink style test, but not being able to yet Not having a way to know if we are effectively implementing the definition of done without waiting for bugs as feedback Changes Coordinate with Philip Thompson to see how we can get to having developers able to add validation to the kitchen sink style test Regards, Ariel