Could you sort the ticket list by assignee so people just have to look for their name once?
On Thu, Apr 9, 2015 at 12:43 PM, Ariel Weisberg <ariel.weisb...@datastax.com > wrote: > Hi, > > Thanks Philip. > > For items in went poorly, is there anything we could change or should have > done differently to get contributor branches running in CI? I don’t recall > us setting a fixed goal for when that had to be done. > > There are two angles to evaluate this from and decide if there is a > problem. We had a schedule goal and we didn’t make it. Or we didn’t have a > schedule goal, but we committed resources and didn’t make progress. > > If it's just not done and we are OK with that it wouldn't go under went > poorly (it's just how it went). > > *Retrospecting the retrospective* > > The retrospective didn’t play out the way I hoped and I think it may be > because I didn’t communicate how we are supposed to use it in enough > detail. For everyone to have nothing to add in the went poorly column means > that in the period of time covered by the retrospective we shipped no bug > fixes for bugs that we think we should have been caught before release. > > We know that isn’t true because in JIRA for 2.1.3 there were 114 resolved > bugs. Some of those are going to be things that were addressed before being > released, but a good chunk have to be fixes for issues in previous > releases. The reason we need to do this through a retrospective is that > it’s a lot of work for me and I lack perspective on what is going on with > each individual issue. Having a PHB chase everyone down and find out what > they are doing is inefficient and doesn’t work because it’s lacks > repeatability/scalability and it builds experienced for me, but not for the > team. > > *Evaluating 2.13 issues:* > Right now we have what you might call a target rich environment. Many > features aren’t exercised in a realistic way or evaluated by success > criteria (performance, space utilization) that users care about. The first > few retrospectives should be the noisiest as we work through that backlog. > > We may be creating regression tests as part of the bug fixes we shipped (I > sure hope we did), but regression tests are != to the kind of tests we > could have written to avoid shipping the bug in the first place. Tests > which have the potential to catch more than just the individual bug in > question. > > I’m going to drag into the went poorly category fixes from 2.1.3 and I > would like to have the involved parties (fundamentally this is the > responsibility of the assignee) chime in on why the bug was released in the > first place and what we could kind of test we could have done before > release to catch it. I removed a number of issues that were enhancements, > bugs fixed before release due to successful process, or wontfixed/not > reproducable. > > Reasons for revisiting are typically a missing regression test, missing > test that we could run now to detect this class of problem in the future, > and most importantly (and also hardest to nail down), what could we do in > the future when doing that kind of work to build effective tests before > release. > > One of things I pick on for some of these is inadequate testing of boundary > conditions, inadequate testing with interrelated components (always hard to > identify and can change over time). Not testing under sufficient load or > with a representative data model or data set is also an issue for some of > these. > > *Homework:* > If you are listed as an assignee you need to triage the ticket. Based on > the experience with that bug are we doing sufficient testing now, and what > kind of testing could have been done before release to find the issue > without the benefit of hindsight. > > Every issue needs a response even if the response is "no work to be done." > If there is work to be done it has to find its way into our testing > strategy (submit a JIRA, or bring it up here). > > *Went poorly:* > *Key* *Assignee* *Summary* *Revisit reason* CASSANDRA-7538 > <https://issues.apache.org/jira/browse/CASSANDRA-7538> Sam Tunnicliffe > Truncate > of a CF should also delete Paxos CF Truncate not tested with PAXOS, what > else? CASSANDRA-7704 < > https://issues.apache.org/jira/browse/CASSANDRA-7704> > Benedict FileNotFoundException during STREAM-OUT triggers 100% CPU > usage Streaming > testing didn't reproduce this before release CASSANDRA-7801 > <https://issues.apache.org/jira/browse/CASSANDRA-7801> Sylvain Lebresne A > successful INSERT with CAS does not always store data in the DB after a > DELETE Multiple access paths for data not tested together CASSANDRA-7910 > <https://issues.apache.org/jira/browse/CASSANDRA-7910> Tyler Hobbs > wildcard > prepared statements are incorrect after a column is added to the table > Alter > table not tested concurrently with ? CASSANDRA-8018 > <https://issues.apache.org/jira/browse/CASSANDRA-8018> Benjamin Lerer > Cassandra > seems to insert twice in custom PerColumnSecondaryIndex Custom secondary > indexes not tested before release? CASSANDRA-8028 > <https://issues.apache.org/jira/browse/CASSANDRA-8028> Carl Yeksigian > Unable > to compute when histogram overflowed Histogram output not tested with > representative data sets, no regression test CASSANDRA-8122 > <https://issues.apache.org/jira/browse/CASSANDRA-8122> Carl Yeksigian > Undeclare > throwable exception while executing 'nodetool netstats localhost' nodetool > not tested against cluster throughout lifecycle, no regression test > CASSANDRA-8211 <https://issues.apache.org/jira/browse/CASSANDRA-8211> > Marcus > Eriksson Overlapping sstables in L1+ Noted hard to reproduce, but still is > there a way we could have, no regression test CASSANDRA-8231 > <https://issues.apache.org/jira/browse/CASSANDRA-8231> Benjamin Lerer > Wrong > size of cached prepared statements Expected cache capacity not validated > with actual cache capcaity, no regression test CASSANDRA-8243 > <https://issues.apache.org/jira/browse/CASSANDRA-8243> Björn Hegerfors > DTCS > can leave time-overlaps, limiting ability to expire entire SSTables > Performance > improving fast path not tested in a representative way CASSANDRA-8264 > <https://issues.apache.org/jira/browse/CASSANDRA-8264> Tyler Hobbs > Problems > with multicolumn relations and COMPACT STORAGE How can we catch > interactions like compact storage not being covered by the test > CASSANDRA-8280 <https://issues.apache.org/jira/browse/CASSANDRA-8280> Sam > Tunnicliffe Cassandra crashing on inserting data over 64K into indexed > strings Added tests are good example, could focusing on testing all access > paths and boundary conditions per access path have prevented this > CASSANDRA-8285 <https://issues.apache.org/jira/browse/CASSANDRA-8285> > Aleksey > Yeschenko Move all hints related tasks to hints private executor Pierre's > reproducer represents something we weren't doing, but that users are. Is > that now being tested? CASSANDRA-8286 > <https://issues.apache.org/jira/browse/CASSANDRA-8286> Tyler Hobbs > Regression > in ORDER BY There were tests that failed in some versions, but not all? Did > this not ship? CASSANDRA-8288 > <https://issues.apache.org/jira/browse/CASSANDRA-8288> Tyler Hobbs cqlsh > describe needs to show 'sstable_compression': '' Roundtrip test for > describe schema? CASSANDRA-8292 > <https://issues.apache.org/jira/browse/CASSANDRA-8292> Joshua McKenzie > From > Pig: org.apache.cassandra.exceptions.ConfigurationException: Expecting URI > in variable: [cassandra.config]. Please prefix the file with file:/// for > local files or file://<server>/ for remote files. PIG not tested > CASSANDRA-8302 <https://issues.apache.org/jira/browse/CASSANDRA-8302> > Tyler > Hobbs Filtering for CONTAINS (KEY) on frozen collection clustering columns > within a partition does not work More untested combinations, could we have > spotted that there was an interaction and tested it? Or did this not ship? > CASSANDRA-8316 <https://issues.apache.org/jira/browse/CASSANDRA-8316> > Marcus > Eriksson "Did not get positive replies from all endpoints" error on > incremental repair What were users doing differently, is there a reproducer > for this running now? CASSANDRA-8320 > <https://issues.apache.org/jira/browse/CASSANDRA-8320> Marcus Eriksson > 2.1.2: > NullPointerException in SSTableWriter What were users doing that caused > this, are we doing that? CASSANDRA-8332 > <https://issues.apache.org/jira/browse/CASSANDRA-8332> T Jake Luciani Null > pointer after droping keyspace Add/drop keyspace not tested under load, > with server logs checked for errors CASSANDRA-8365 > <https://issues.apache.org/jira/browse/CASSANDRA-8365> Benjamin Lerer > CamelCase > name is used as index name instead of lowercase How can we establish UI > consistency? CASSANDRA-8370 > <https://issues.apache.org/jira/browse/CASSANDRA-8370> Sam Tunnicliffe > cqlsh > doesn't handle LIST statements correctly cqlsh untested functionality, no > regression test? CASSANDRA-8383 > <https://issues.apache.org/jira/browse/CASSANDRA-8383> Benedict Memtable > flush may expire records from the commit log that are in a later memtable > No > regression test, no follow up ticket. Could/should this have been > reproducable as an actual bug? CASSANDRA-8386 > <https://issues.apache.org/jira/browse/CASSANDRA-8386> Marcus Eriksson > Make > sure we release references to sstables after incremental repair Is there a > higher level test that could have observed this failure? CASSANDRA-8401 > <https://issues.apache.org/jira/browse/CASSANDRA-8401> Jonathan Ellis > dropping > a CF doesn't remove the latency-sampling task Another argument for a schema > change stress test, maybe tracking for constant memory utilization > CASSANDRA-8408 <https://issues.apache.org/jira/browse/CASSANDRA-8408> > Tyler > Hobbs limit appears to replace page size under certain conditions No test > that validates that paging returns the expected number of results? Another > of the genre of queries we support but don't test all the combinations > CASSANDRA-8410 <https://issues.apache.org/jira/browse/CASSANDRA-8410> > Tyler > Hobbs Select with many IN values on clustering columns can result in a > StackOverflowError Another missing boundary conditions test, test maximum > size in clause against * CASSANDRA-8421 > <https://issues.apache.org/jira/browse/CASSANDRA-8421> Benjamin Lerer > Cassandra > 2.1.1 & Cassandra 2.1.2 UDT not returning value for LIST type as UDT Is > there a test that could have found this condition before release? > CASSANDRA-8429 <https://issues.apache.org/jira/browse/CASSANDRA-8429> > Benedict Some keys unreadable during compaction Running stress in CI would > have caught this, and we're going to do that CASSANDRA-8432 > <https://issues.apache.org/jira/browse/CASSANDRA-8432> Marcus Eriksson > Standalone > Scrubber broken for LCS Standalone scrubber not tested, no regression test > CASSANDRA-8448 <https://issues.apache.org/jira/browse/CASSANDRA-8448> > Brandon > Williams "Comparison method violates its general contract" in > AbstractEndpointSnitch This just happens periodically? Was the snitch no > tested under load and the log output checked for errors? CASSANDRA-8451 > <https://issues.apache.org/jira/browse/CASSANDRA-8451> Tyler Hobbs NPE > when > writetime() or ttl() are nested inside function call Is this testable? Can > we check that functions compose correctly or validate that they are > inherently composable. No regression test. CASSANDRA-8458 > <https://issues.apache.org/jira/browse/CASSANDRA-8458> Marcus Eriksson > Don't > give out positions in an sstable beyond its first/last tokens Streaming not > done in realistic scenario with validation of logging CASSANDRA-8459 > <https://issues.apache.org/jira/browse/CASSANDRA-8459> Benedict > "autocompaction" > on reads can prevent memtable space reclaimation What would have reproduced > this before release? CASSANDRA-8462 > <https://issues.apache.org/jira/browse/CASSANDRA-8462> Aleksey > Yeschenko Upgrading > a 2.0 to 2.1 breaks CFMetaData on 2.0 nodes Have additional dtest coverage, > need to do this in kitchen sink tests CASSANDRA-8463 > <https://issues.apache.org/jira/browse/CASSANDRA-8463> Marcus Eriksson > Constant > compaction under LCS What would have reproduced this before release? > CASSANDRA-8490 <https://issues.apache.org/jira/browse/CASSANDRA-8490> > Tyler > Hobbs DISTINCT queries with LIMITs or paging are incorrect when partitions > are deleted Untested query forms, no regression test CASSANDRA-8499 > <https://issues.apache.org/jira/browse/CASSANDRA-8499> Benedict Ensure > SSTableWriter cleans up properly after failure Testing error paths? Any way > to test things in a loop to detect leaks? CASSANDRA-8510 > <https://issues.apache.org/jira/browse/CASSANDRA-8510> Marcus Eriksson > CompactionManager.submitMaximal > may leak resources Not a user visible problem, so difficult to catch in > test, but is there a way CASSANDRA-8512 > <https://issues.apache.org/jira/browse/CASSANDRA-8512> Tyler Hobbs cqlsh > unusable after encountering schema mismatch cqlsh not tested with other > functionality active CASSANDRA-8513 > <https://issues.apache.org/jira/browse/CASSANDRA-8513> Benedict > SSTableScanner > may not acquire reference, but will still release it when closed This had a > user visible component, what test could have caught it befor erelease? > CASSANDRA-8514 <https://issues.apache.org/jira/browse/CASSANDRA-8514> > Benjamin > Lerer ArrayIndexOutOfBoundsException in nodetool cfhistograms Not released, > but not caught by automated tests either CASSANDRA-8525 > <https://issues.apache.org/jira/browse/CASSANDRA-8525> Marcus Eriksson > Bloom > Filter truePositive counter not updated on key cache hit User visible > metric not accurate, but only in one config. Possible to guess correct FP > ratio and validate while exploring config space? CASSANDRA-8532 > <https://issues.apache.org/jira/browse/CASSANDRA-8532> Marcus Eriksson Fix > calculation of expected write size during compaction Did this manifest as a > user visible issue, could we have tested for that? CASSANDRA-8537 > <https://issues.apache.org/jira/browse/CASSANDRA-8537> Marcus Eriksson > ConcurrentModificationException > while executing 'nodetool cleanup' Nodetool cleanup not tested before > release CASSANDRA-8550 > <https://issues.apache.org/jira/browse/CASSANDRA-8550> Tyler Hobbs > Internal > pagination in CQL3 index queries creating substantial overhead Pagination > not performance tested with representative data models CASSANDRA-8558 > <https://issues.apache.org/jira/browse/CASSANDRA-8558> Sylvain Lebresne > deleted > row still can be selected out Validate that deleted data stays deleted > under * conditions (big matrix of interactions here with different > configurations, streaming, repair, cleanup, scrub). Deleted data coming > back shows up a lot. CASSANDRA-8562 > <https://issues.apache.org/jira/browse/CASSANDRA-8562> Marcus Eriksson Fix > checking available disk space before compaction starts Is there a user > visible negative impact, could it have been tested for? CASSANDRA-8563 > <https://issues.apache.org/jira/browse/CASSANDRA-8563> Tyler Hobbs cqlsh > broken for some thrift created tables. Validate mixed CQL thrift > interactions? Possibly abstract everything to be done either by CQL or > Thrift and then permute? Seems low value, but necessary if both are claimed > to be supported. CASSANDRA-8577 > <https://issues.apache.org/jira/browse/CASSANDRA-8577> Artem Aliev Values > of set types not loading correctly into Pig Full set of interactions with > PIG not validated CASSANDRA-8579 > <https://issues.apache.org/jira/browse/CASSANDRA-8579> Jimmy Mårdell > sstablemetadata > can't load org.apache.cassandra.tools.SSTableMetadataViewer Running C* from > source tree not representative of behavior of deployed builds > CASSANDRA-8580 <https://issues.apache.org/jira/browse/CASSANDRA-8580> > Marcus > Eriksson AssertionErrors after activating unchecked_tombstone_compaction > with leveled compaction How could this have been reproduced before release? > No regression test CASSANDRA-8588 > <https://issues.apache.org/jira/browse/CASSANDRA-8588> Dave Brosius Fix > DropTypeStatements isusedBy for maps (typo ignored values) Not released, > but was it detected before release by an automated test? CASSANDRA-8619 > <https://issues.apache.org/jira/browse/CASSANDRA-8619> Benedict using > CQLSSTableWriter gives ConcurrentModificationException What kind of test > would have caught this before release? CASSANDRA-8623 > <https://issues.apache.org/jira/browse/CASSANDRA-8623> Marcus Eriksson > sstablesplit > fails *randomly* with Data component is missing Feature not tested before > release? No regression test CASSANDRA-8632 > <https://issues.apache.org/jira/browse/CASSANDRA-8632> Benedict > cassandra-stress > only generating a single unique row We rely on stress for performance > testing, that might mean it needs real testing that demonstrates it > generates load that looks like the load it is supposed to be generating. > CASSANDRA-8635 <https://issues.apache.org/jira/browse/CASSANDRA-8635> > Marcus > Eriksson STCS cold sstable omission does not handle overwrites without > reads If > this workload is a challenge for certain kinds of optimizations we should > test it if we think it could happen again. CASSANDRA-8640 > <https://issues.apache.org/jira/browse/CASSANDRA-8640> Anthony Cozzie > Paxos > requires all nodes for CAS If PAXOS is not supposed to require all nodes > for CAS we should be able to fail nodes or a certain number of nodes and > still continue to CAS (test availability of CAS under failure conditions). > No regression test. CASSANDRA-8641 > <https://issues.apache.org/jira/browse/CASSANDRA-8641> *Unassigned* Repair > causes a large number of tiny SSTables User says something doesn't work for > them? Could we have anticipated that vnodes would not work as formulated > for this case. CASSANDRA-8652 > <https://issues.apache.org/jira/browse/CASSANDRA-8652> Edward Ribeiro DROP > TABLE should also drop BATCH prepared statements associated to it Not sure > if this is an optimization or fixes a user visible issue, but could this > have been detected by exercising the functionality better before release. > CASSANDRA-8668 <https://issues.apache.org/jira/browse/CASSANDRA-8668> > Benedict We don't enforce offheap memory constraints; regression introduced > by 7882 Memory constraints was a supported feature/UI, but not completely > tested before release. Could this have been found most effectively by a > unit test or a blackbox test? CASSANDRA-8675 > <https://issues.apache.org/jira/browse/CASSANDRA-8675> *Unassigned* COPY > TO/FROM broken for newline characters COPY TO/FROM not tested with > representative data CASSANDRA-8677 > <https://issues.apache.org/jira/browse/CASSANDRA-8677> Ariel Weisberg > rpc_interface > and listen_interface generate NPE on startup when specified interface > doesn't exist Missing unit tests checking error messages for > DatabaseDescriptor CASSANDRA-8687 > <https://issues.apache.org/jira/browse/CASSANDRA-8687> Jeremiah Jordan > Keyspace > should also check Config.isClientMode Is there a way to test for missing > Config.isClientMode checks? CASSANDRA-8688 > <https://issues.apache.org/jira/browse/CASSANDRA-8688> Yuki Morishita > Standalone > sstableupgrade tool throws exception Tool not tested before release, no > regression test CASSANDRA-8691 > <https://issues.apache.org/jira/browse/CASSANDRA-8691> *Unassigned* > SSTableReader.getPosition() > does not correctly filter out queries that exceed its bounds Is there a > scenario where this is user visible, should we test for that? > CASSANDRA-8694 <https://issues.apache.org/jira/browse/CASSANDRA-8694> Jeff > Jirsa Repair of empty keyspace hangs rather than ignoring the request > Missing > boundary condition test, requesting operation on empty, non-existent, or > not applicable entity. CASSANDRA-8695 > <https://issues.apache.org/jira/browse/CASSANDRA-8695> Chris Lockfort > thrift > column definition list sometimes immutable What user visible activities > reproduced this, could we have done that before release? CASSANDRA-8719 > <https://issues.apache.org/jira/browse/CASSANDRA-8719> Benedict Using > thrift HSHA with offheap_objects appears to corrupt data Untested > configuration before release, this would be straightforward if we ran with > it? CASSANDRA-8726 <https://issues.apache.org/jira/browse/CASSANDRA-8726> > Benedict throw OOM in Memory if we fail to allocate OOM test Cassandra? Try > and validate that it fails cleanly and can be restarted on OOM? Same for > disk full. CASSANDRA-8733 > <https://issues.apache.org/jira/browse/CASSANDRA-8733> Tyler Hobbs List > prepend reverses item order There was a test so sometimes this just > happens. > > Thanks, > Ariel > > On Apr 2, 2015, at 5:21 PM, Philip Thompson <philip.thomp...@datastax.com> > wrote: > > To add to this: > > > *Went well* > Tyler Hobbs has reduced failing dtests on trunk by ~90%. By next month, > test results should be at 100% pass. > > *Went poorly* > We've failed to make progress on running the full test suite across all > contributor branches. By the end of this month, I assume we will at least > have limited functionality in this area. > > On Wed, Apr 1, 2015 at 3:57 PM, Ariel Weisberg < > ariel.weisb...@datastax.com> > wrote: > > Hi all, > > It’s time for the first retrospective. For those not familiar this is the > part of the development process where we discuss what is and isn’t working > when it comes to making reliable releases. We go over the things that > worked, the things that didn’t work, and what changes we are going to make. > > This is not a forum for discussing individual bugs (or bugs fixed before > release due to successful process) although you can cite one and we can > discuss what we could have done differently to catch it. Even if a bug > wasn’t released if it was caught the wrong way (blind luck) and you think > our process wouldn’t have caught it you can bring that up as well. > > I don’t expect this retrospective to be the most productive because we > already know we are far behind in several areas (passing utests, dtests, > running utests and dtests for on each commit, running a larger black box > system test) and many issues will circle back around to being addressed by > one of those three. > > If your a developer you can review all things you have committed (or > reviewed) in the past month and ask yourself if it met the criteria of done > that we agreed on including adding tests for existing untested code > (usually the thing missed). Better to do it now then after discovering your > definition of done was flawed because it released a preventible bug. > > For this one retrospective you can reach back further to something already > released that you feel passionate about, and if you can point to a utest or > dtest that should have caught it that is still missing we can add that to > the list of things to test. That would go under CASSANDRA-9012 (Triage > missing test coverage) < > https://issues.apache.org/jira/browse/CASSANDRA-9012>. > > There is a root JIRA <https://issues.apache.org/jira/browse/CASSANDRA-9042 > > > for making trunk always releasable. A lot falls under CASSANDRA-9007 ( Run > stress nightly against trunk in a way that validates ) < > https://issues.apache.org/jira/browse/CASSANDRA-9007> which is the root > for a new kitchen sink style test that validates the entire feature set > together in a black box fashion. Philip Thompson has a basic job running so > we are close to (or at) the tipping point where the doneness criteria for > every ticket needs to include making sure this job covers the thing you > added/changed. If you aren’t going to add the coverage you need to justify > (to yourself and your reviewer) breaking it out into something separate and > file a JIRA indicating the coverage was missing (if one doesn’t already > exist). Make sure to link it to 9007 so we can see what has already been > reported. > > The reason I say we might not be at the tipping point is that while we > have the job we haven’t ironed out how stress (or something new) will act > as a container for validating multiple features. Especially in an > environment where things like cluster/node failures and topology changes > occur. > > Retrospectives aren’t supposed to include the preceding paragraphs we > should funnel discussion about them into a separate email thread. > > On to the retrospective. This is more for me to solicit from information > from you then for me to push information to you. > > Went well > Positive response to the definition of done > Lot’s of manpower from QA and progress on test infrastructure > Went poorly > Some wanting to add validation to a kitchen sink style test, but not being > able to yet > Not having a way to know if we are effectively implementing the definition > of done without waiting for bugs as feedback > Changes > Coordinate with Philip Thompson to see how we can get to having developers > able to add validation to the kitchen sink style test > > Regards, > Ariel > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced