Re: March 2015 QA retrospective

Jonathan Ellis Thu, 09 Apr 2015 11:13:27 -0700

Could you sort the ticket list by assignee so people just have to look for
their name once?


On Thu, Apr 9, 2015 at 12:43 PM, Ariel Weisberg <ariel.weisb...@datastax.com
> wrote:

> Hi,
>
> Thanks Philip.
>
> For items in went poorly, is there anything we could change or should have
> done differently to get contributor branches running in CI? I don’t recall
> us setting a fixed goal for when that had to be done.
>
> There are two angles to evaluate this from and decide if there is a
> problem. We had a schedule goal and we didn’t make it. Or we didn’t have a
> schedule goal, but we committed resources and didn’t make progress.
>
> If it's just not done and we are OK with that it wouldn't go under went
> poorly (it's just how it went).
>
> *Retrospecting the retrospective*
>
> The retrospective didn’t play out the way I hoped and I think it may be
> because I didn’t communicate how we are supposed to use it in enough
> detail. For everyone to have nothing to add in the went poorly column means
> that in the period of time covered by the retrospective we shipped no bug
> fixes for bugs that we think we should have been caught before release.
>
> We know that isn’t true because in JIRA for 2.1.3 there were 114 resolved
> bugs. Some of those are going to be things that were addressed before being
> released, but a good chunk have to be fixes for issues in previous
> releases. The reason we need to do this through a retrospective is that
> it’s a lot of work for me and I lack perspective on what is going on with
> each individual issue. Having a PHB chase everyone down and find out what
> they are doing is inefficient and doesn’t work because it’s lacks
> repeatability/scalability and it builds experienced for me, but not for the
> team.
>
> *Evaluating 2.13 issues:*
> Right now we have what you might call a target rich environment. Many
> features aren’t exercised in a realistic way or evaluated by success
> criteria (performance, space utilization) that users care about. The first
> few retrospectives should be the noisiest as we work through that backlog.
>
> We may be creating regression tests as part of the bug fixes we shipped (I
> sure hope we did), but regression tests are != to the kind of tests we
> could have written to avoid shipping the bug in the first place. Tests
> which have the potential to catch more than just the individual bug in
> question.
>
> I’m going to drag into the went poorly category fixes from 2.1.3 and I
> would like to have the involved parties (fundamentally this is the
> responsibility of the assignee) chime in on why the bug was released in the
> first place and what we could kind of test we could have done before
> release to catch it. I removed a number of issues that were enhancements,
> bugs fixed before release due to successful process, or wontfixed/not
> reproducable.
>
> Reasons for revisiting are typically a missing regression test, missing
> test that we could run now to detect this class of problem in the future,
> and most importantly (and also hardest to nail down), what could we do in
> the future when doing that kind of work to build effective tests before
> release.
>
> One of things I pick on for some of these is inadequate testing of boundary
> conditions, inadequate testing with interrelated components (always hard to
> identify and can change over time). Not testing under sufficient load or
> with a representative data model or data set is also an issue for some of
> these.
>
> *Homework:*
> If you are listed as an assignee you need to triage the ticket. Based on
> the experience with that bug are we doing sufficient testing now, and what
> kind of testing could have been done before release to find the issue
> without the benefit of hindsight.
>
> Every issue needs a response even if the response is "no work to be done."
> If there is work to be done it has to find its way into our testing
> strategy (submit a JIRA, or bring it up here).
>
> *Went poorly:*
>    *Key* *Assignee* *Summary* *Revisit reason*  CASSANDRA-7538
> <https://issues.apache.org/jira/browse/CASSANDRA-7538> Sam Tunnicliffe
> Truncate
> of a CF should also delete Paxos CF Truncate not tested with PAXOS, what
> else?  CASSANDRA-7704 <
> https://issues.apache.org/jira/browse/CASSANDRA-7704>
> Benedict FileNotFoundException during STREAM-OUT triggers 100% CPU
> usage Streaming
> testing didn't reproduce this before release  CASSANDRA-7801
> <https://issues.apache.org/jira/browse/CASSANDRA-7801> Sylvain Lebresne A
> successful INSERT with CAS does not always store data in the DB after a
> DELETE Multiple access paths for data not tested together  CASSANDRA-7910
> <https://issues.apache.org/jira/browse/CASSANDRA-7910> Tyler Hobbs
> wildcard
> prepared statements are incorrect after a column is added to the table
> Alter
> table not tested concurrently with ?  CASSANDRA-8018
> <https://issues.apache.org/jira/browse/CASSANDRA-8018> Benjamin Lerer
> Cassandra
> seems to insert twice in custom PerColumnSecondaryIndex Custom secondary
> indexes not tested before release?  CASSANDRA-8028
> <https://issues.apache.org/jira/browse/CASSANDRA-8028> Carl Yeksigian
> Unable
> to compute when histogram overflowed Histogram output not tested with
> representative data sets, no regression test  CASSANDRA-8122
> <https://issues.apache.org/jira/browse/CASSANDRA-8122> Carl Yeksigian
> Undeclare
> throwable exception while executing 'nodetool netstats localhost' nodetool
> not tested against cluster throughout lifecycle, no regression test
> CASSANDRA-8211 <https://issues.apache.org/jira/browse/CASSANDRA-8211>
> Marcus
> Eriksson Overlapping sstables in L1+ Noted hard to reproduce, but still is
> there a way we could have, no regression test  CASSANDRA-8231
> <https://issues.apache.org/jira/browse/CASSANDRA-8231> Benjamin Lerer
> Wrong
> size of cached prepared statements Expected cache capacity not validated
> with actual cache capcaity, no regression test  CASSANDRA-8243
> <https://issues.apache.org/jira/browse/CASSANDRA-8243> Björn Hegerfors
> DTCS
> can leave time-overlaps, limiting ability to expire entire SSTables
> Performance
> improving fast path not tested in a representative way  CASSANDRA-8264
> <https://issues.apache.org/jira/browse/CASSANDRA-8264> Tyler Hobbs
> Problems
> with multicolumn relations and COMPACT STORAGE How can we catch
> interactions like compact storage not being covered by the test
> CASSANDRA-8280 <https://issues.apache.org/jira/browse/CASSANDRA-8280> Sam
> Tunnicliffe Cassandra crashing on inserting data over 64K into indexed
> strings Added tests are good example, could focusing on testing all access
> paths and boundary conditions per access path have prevented this
> CASSANDRA-8285 <https://issues.apache.org/jira/browse/CASSANDRA-8285>
> Aleksey
> Yeschenko Move all hints related tasks to hints private executor Pierre's
> reproducer represents something we weren't doing, but that users are. Is
> that now being tested?  CASSANDRA-8286
> <https://issues.apache.org/jira/browse/CASSANDRA-8286> Tyler Hobbs
> Regression
> in ORDER BY There were tests that failed in some versions, but not all? Did
> this not ship?  CASSANDRA-8288
> <https://issues.apache.org/jira/browse/CASSANDRA-8288> Tyler Hobbs cqlsh
> describe needs to show 'sstable_compression': '' Roundtrip test for
> describe schema?  CASSANDRA-8292
> <https://issues.apache.org/jira/browse/CASSANDRA-8292> Joshua McKenzie
> From
> Pig: org.apache.cassandra.exceptions.ConfigurationException: Expecting URI
> in variable: [cassandra.config]. Please prefix the file with file:/// for
> local files or file://<server>/ for remote files. PIG not tested
> CASSANDRA-8302 <https://issues.apache.org/jira/browse/CASSANDRA-8302>
> Tyler
> Hobbs Filtering for CONTAINS (KEY) on frozen collection clustering columns
> within a partition does not work More untested combinations, could we have
> spotted that there was an interaction and tested it? Or did this not ship?
> CASSANDRA-8316 <https://issues.apache.org/jira/browse/CASSANDRA-8316>
> Marcus
> Eriksson "Did not get positive replies from all endpoints" error on
> incremental repair What were users doing differently, is there a reproducer
> for this running now?  CASSANDRA-8320
> <https://issues.apache.org/jira/browse/CASSANDRA-8320> Marcus Eriksson
> 2.1.2:
> NullPointerException in SSTableWriter What were users doing that caused
> this, are we doing that?  CASSANDRA-8332
> <https://issues.apache.org/jira/browse/CASSANDRA-8332> T Jake Luciani Null
> pointer after droping keyspace Add/drop keyspace not tested under load,
> with server logs checked for errors  CASSANDRA-8365
> <https://issues.apache.org/jira/browse/CASSANDRA-8365> Benjamin Lerer
> CamelCase
> name is used as index name instead of lowercase How can we establish UI
> consistency?  CASSANDRA-8370
> <https://issues.apache.org/jira/browse/CASSANDRA-8370> Sam Tunnicliffe
> cqlsh
> doesn't handle LIST statements correctly cqlsh untested functionality, no
> regression test?  CASSANDRA-8383
> <https://issues.apache.org/jira/browse/CASSANDRA-8383> Benedict Memtable
> flush may expire records from the commit log that are in a later memtable
> No
> regression test, no follow up ticket. Could/should this have been
> reproducable as an actual bug?  CASSANDRA-8386
> <https://issues.apache.org/jira/browse/CASSANDRA-8386> Marcus Eriksson
> Make
> sure we release references to sstables after incremental repair Is there a
> higher level test that could have observed this failure?  CASSANDRA-8401
> <https://issues.apache.org/jira/browse/CASSANDRA-8401> Jonathan Ellis
> dropping
> a CF doesn't remove the latency-sampling task Another argument for a schema
> change stress test, maybe tracking for constant memory utilization
> CASSANDRA-8408 <https://issues.apache.org/jira/browse/CASSANDRA-8408>
> Tyler
> Hobbs limit appears to replace page size under certain conditions No test
> that validates that paging returns the expected number of results? Another
> of the genre of queries we support but don't test all the combinations
> CASSANDRA-8410 <https://issues.apache.org/jira/browse/CASSANDRA-8410>
> Tyler
> Hobbs Select with many IN values on clustering columns can result in a
> StackOverflowError Another missing boundary conditions test, test maximum
> size in clause against *  CASSANDRA-8421
> <https://issues.apache.org/jira/browse/CASSANDRA-8421> Benjamin Lerer
> Cassandra
> 2.1.1 & Cassandra 2.1.2 UDT not returning value for LIST type as UDT Is
> there a test that could have found this condition before release?
> CASSANDRA-8429 <https://issues.apache.org/jira/browse/CASSANDRA-8429>
> Benedict Some keys unreadable during compaction Running stress in CI would
> have caught this, and we're going to do that  CASSANDRA-8432
> <https://issues.apache.org/jira/browse/CASSANDRA-8432> Marcus Eriksson
> Standalone
> Scrubber broken for LCS Standalone scrubber not tested, no regression test
> CASSANDRA-8448 <https://issues.apache.org/jira/browse/CASSANDRA-8448>
> Brandon
> Williams "Comparison method violates its general contract" in
> AbstractEndpointSnitch This just happens periodically? Was the snitch no
> tested under load and the log output checked for errors?  CASSANDRA-8451
> <https://issues.apache.org/jira/browse/CASSANDRA-8451> Tyler Hobbs NPE
> when
> writetime() or ttl() are nested inside function call Is this testable? Can
> we check that functions compose correctly or validate that they are
> inherently composable. No regression test.  CASSANDRA-8458
> <https://issues.apache.org/jira/browse/CASSANDRA-8458> Marcus Eriksson
> Don't
> give out positions in an sstable beyond its first/last tokens Streaming not
> done in realistic scenario with validation of logging  CASSANDRA-8459
> <https://issues.apache.org/jira/browse/CASSANDRA-8459> Benedict
> "autocompaction"
> on reads can prevent memtable space reclaimation What would have reproduced
> this before release?  CASSANDRA-8462
> <https://issues.apache.org/jira/browse/CASSANDRA-8462> Aleksey
> Yeschenko Upgrading
> a 2.0 to 2.1 breaks CFMetaData on 2.0 nodes Have additional dtest coverage,
> need to do this in kitchen sink tests  CASSANDRA-8463
> <https://issues.apache.org/jira/browse/CASSANDRA-8463> Marcus Eriksson
> Constant
> compaction under LCS What would have reproduced this before release?
> CASSANDRA-8490 <https://issues.apache.org/jira/browse/CASSANDRA-8490>
> Tyler
> Hobbs DISTINCT queries with LIMITs or paging are incorrect when partitions
> are deleted Untested query forms, no regression test  CASSANDRA-8499
> <https://issues.apache.org/jira/browse/CASSANDRA-8499> Benedict Ensure
> SSTableWriter cleans up properly after failure Testing error paths? Any way
> to test things in a loop to detect leaks?  CASSANDRA-8510
> <https://issues.apache.org/jira/browse/CASSANDRA-8510> Marcus Eriksson
> CompactionManager.submitMaximal
> may leak resources Not a user visible problem, so difficult to catch in
> test, but is there a way  CASSANDRA-8512
> <https://issues.apache.org/jira/browse/CASSANDRA-8512> Tyler Hobbs cqlsh
> unusable after encountering schema mismatch cqlsh not tested with other
> functionality active  CASSANDRA-8513
> <https://issues.apache.org/jira/browse/CASSANDRA-8513> Benedict
> SSTableScanner
> may not acquire reference, but will still release it when closed This had a
> user visible component, what test could have caught it befor erelease?
> CASSANDRA-8514 <https://issues.apache.org/jira/browse/CASSANDRA-8514>
> Benjamin
> Lerer ArrayIndexOutOfBoundsException in nodetool cfhistograms Not released,
> but not caught by automated tests either  CASSANDRA-8525
> <https://issues.apache.org/jira/browse/CASSANDRA-8525> Marcus Eriksson
> Bloom
> Filter truePositive counter not updated on key cache hit User visible
> metric not accurate, but only in one config. Possible to guess correct FP
> ratio and validate while exploring config space?  CASSANDRA-8532
> <https://issues.apache.org/jira/browse/CASSANDRA-8532> Marcus Eriksson Fix
> calculation of expected write size during compaction Did this manifest as a
> user visible issue, could we have tested for that?  CASSANDRA-8537
> <https://issues.apache.org/jira/browse/CASSANDRA-8537> Marcus Eriksson
> ConcurrentModificationException
> while executing 'nodetool cleanup' Nodetool cleanup not tested before
> release  CASSANDRA-8550
> <https://issues.apache.org/jira/browse/CASSANDRA-8550> Tyler Hobbs
> Internal
> pagination in CQL3 index queries creating substantial overhead Pagination
> not performance tested with representative data models  CASSANDRA-8558
> <https://issues.apache.org/jira/browse/CASSANDRA-8558> Sylvain Lebresne
> deleted
> row still can be selected out Validate that deleted data stays deleted
> under * conditions (big matrix of interactions here with different
> configurations, streaming, repair, cleanup, scrub). Deleted data coming
> back shows up a lot.  CASSANDRA-8562
> <https://issues.apache.org/jira/browse/CASSANDRA-8562> Marcus Eriksson Fix
> checking available disk space before compaction starts Is there a user
> visible negative impact, could it have been tested for?  CASSANDRA-8563
> <https://issues.apache.org/jira/browse/CASSANDRA-8563> Tyler Hobbs cqlsh
> broken for some thrift created tables. Validate mixed CQL thrift
> interactions? Possibly abstract everything to be done either by CQL or
> Thrift and then permute? Seems low value, but necessary if both are claimed
> to be supported.  CASSANDRA-8577
> <https://issues.apache.org/jira/browse/CASSANDRA-8577> Artem Aliev Values
> of set types not loading correctly into Pig Full set of interactions with
> PIG not validated  CASSANDRA-8579
> <https://issues.apache.org/jira/browse/CASSANDRA-8579> Jimmy Mårdell
> sstablemetadata
> can't load org.apache.cassandra.tools.SSTableMetadataViewer Running C* from
> source tree not representative of behavior of deployed builds
> CASSANDRA-8580 <https://issues.apache.org/jira/browse/CASSANDRA-8580>
> Marcus
> Eriksson AssertionErrors after activating unchecked_tombstone_compaction
> with leveled compaction How could this have been reproduced before release?
> No regression test  CASSANDRA-8588
> <https://issues.apache.org/jira/browse/CASSANDRA-8588> Dave Brosius Fix
> DropTypeStatements isusedBy for maps (typo ignored values) Not released,
> but was it detected before release by an automated test?  CASSANDRA-8619
> <https://issues.apache.org/jira/browse/CASSANDRA-8619> Benedict using
> CQLSSTableWriter gives ConcurrentModificationException What kind of test
> would have caught this before release?  CASSANDRA-8623
> <https://issues.apache.org/jira/browse/CASSANDRA-8623> Marcus Eriksson
> sstablesplit
> fails *randomly* with Data component is missing Feature not tested before
> release? No regression test  CASSANDRA-8632
> <https://issues.apache.org/jira/browse/CASSANDRA-8632> Benedict
> cassandra-stress
> only generating a single unique row We rely on stress for performance
> testing, that might mean it needs real testing that demonstrates it
> generates load that looks like the load it is supposed to be generating.
> CASSANDRA-8635 <https://issues.apache.org/jira/browse/CASSANDRA-8635>
> Marcus
> Eriksson STCS cold sstable omission does not handle overwrites without
> reads If
> this workload is a challenge for certain kinds of optimizations we should
> test it if we think it could happen again.  CASSANDRA-8640
> <https://issues.apache.org/jira/browse/CASSANDRA-8640> Anthony Cozzie
> Paxos
> requires all nodes for CAS If PAXOS is not supposed to require all nodes
> for CAS we should be able to fail nodes or a certain number of nodes and
> still continue to CAS (test availability of CAS under failure conditions).
> No regression test.  CASSANDRA-8641
> <https://issues.apache.org/jira/browse/CASSANDRA-8641> *Unassigned* Repair
> causes a large number of tiny SSTables User says something doesn't work for
> them? Could we have anticipated that vnodes would not work as formulated
> for this case.  CASSANDRA-8652
> <https://issues.apache.org/jira/browse/CASSANDRA-8652> Edward Ribeiro DROP
> TABLE should also drop BATCH prepared statements associated to it Not sure
> if this is an optimization or fixes a user visible issue, but could this
> have been detected by exercising the functionality better before release.
> CASSANDRA-8668 <https://issues.apache.org/jira/browse/CASSANDRA-8668>
> Benedict We don't enforce offheap memory constraints; regression introduced
> by 7882 Memory constraints was a supported feature/UI, but not completely
> tested before release. Could this have been found most effectively by a
> unit test or a blackbox test?  CASSANDRA-8675
> <https://issues.apache.org/jira/browse/CASSANDRA-8675> *Unassigned* COPY
> TO/FROM broken for newline characters COPY TO/FROM not tested with
> representative data  CASSANDRA-8677
> <https://issues.apache.org/jira/browse/CASSANDRA-8677> Ariel Weisberg
> rpc_interface
> and listen_interface generate NPE on startup when specified interface
> doesn't exist Missing unit tests checking error messages for
> DatabaseDescriptor  CASSANDRA-8687
> <https://issues.apache.org/jira/browse/CASSANDRA-8687> Jeremiah Jordan
> Keyspace
> should also check Config.isClientMode Is there a way to test for missing
> Config.isClientMode checks?  CASSANDRA-8688
> <https://issues.apache.org/jira/browse/CASSANDRA-8688> Yuki Morishita
> Standalone
> sstableupgrade tool throws exception Tool not tested before release, no
> regression test  CASSANDRA-8691
> <https://issues.apache.org/jira/browse/CASSANDRA-8691> *Unassigned*
> SSTableReader.getPosition()
> does not correctly filter out queries that exceed its bounds Is there a
> scenario where this is user visible, should we test for that?
> CASSANDRA-8694 <https://issues.apache.org/jira/browse/CASSANDRA-8694> Jeff
> Jirsa Repair of empty keyspace hangs rather than ignoring the request
> Missing
> boundary condition test, requesting operation on empty, non-existent, or
> not applicable entity.  CASSANDRA-8695
> <https://issues.apache.org/jira/browse/CASSANDRA-8695> Chris Lockfort
> thrift
> column definition list sometimes immutable What user visible activities
> reproduced this, could we have done that before release?  CASSANDRA-8719
> <https://issues.apache.org/jira/browse/CASSANDRA-8719> Benedict Using
> thrift HSHA with offheap_objects appears to corrupt data Untested
> configuration before release, this would be straightforward if we ran with
> it?  CASSANDRA-8726 <https://issues.apache.org/jira/browse/CASSANDRA-8726>
> Benedict throw OOM in Memory if we fail to allocate OOM test Cassandra? Try
> and validate that it fails cleanly and can be restarted on OOM? Same for
> disk full.  CASSANDRA-8733
> <https://issues.apache.org/jira/browse/CASSANDRA-8733> Tyler Hobbs List
> prepend reverses item order There was a test so sometimes this just
> happens.
>
> Thanks,
> Ariel
>
> On Apr 2, 2015, at 5:21 PM, Philip Thompson <philip.thomp...@datastax.com>
> wrote:
>
> To add to this:
>
>
> *Went well*
> Tyler Hobbs has reduced failing dtests on trunk by ~90%. By next month,
> test results should be at 100% pass.
>
> *Went poorly*
> We've failed to make progress on running the full test suite across all
> contributor branches. By the end of this month, I assume we will at least
> have limited functionality in this area.
>
> On Wed, Apr 1, 2015 at 3:57 PM, Ariel Weisberg <
> ariel.weisb...@datastax.com>
> wrote:
>
> Hi all,
>
> It’s time for the first retrospective. For those not familiar this is the
> part of the development process where we discuss what is and isn’t working
> when it comes to making reliable releases. We go over the things that
> worked, the things that didn’t work, and what changes we are going to make.
>
> This is not a forum for discussing individual bugs (or bugs fixed before
> release due to successful process) although you can cite one and we can
> discuss what we could have done differently to catch it. Even if a bug
> wasn’t released if it was caught the wrong way (blind luck) and you think
> our process wouldn’t have caught it you can bring that up as well.
>
> I don’t expect this retrospective to be the most productive because we
> already know we are far behind in several areas (passing utests, dtests,
> running utests and dtests for on each commit, running a larger black box
> system test) and many issues will circle back around to being addressed by
> one of those three.
>
> If your a developer you can review all things you have committed (or
> reviewed) in the past month and ask yourself if it met the criteria of done
> that we agreed on including adding tests for existing untested code
> (usually the thing missed). Better to do it now then after discovering your
> definition of done was flawed because it released a preventible bug.
>
> For this one retrospective you can reach back further to something already
> released that you feel passionate about, and if you can point to a utest or
> dtest that should have caught it that is still missing we can add that to
> the list of things to test. That would go under CASSANDRA-9012 (Triage
> missing test coverage) <
> https://issues.apache.org/jira/browse/CASSANDRA-9012>.
>
> There is a root JIRA <https://issues.apache.org/jira/browse/CASSANDRA-9042
> >
> for making trunk always releasable. A lot falls under CASSANDRA-9007 ( Run
> stress nightly against trunk in a way that validates ) <
> https://issues.apache.org/jira/browse/CASSANDRA-9007> which is the root
> for a new kitchen sink style test that validates the entire feature set
> together in a black box fashion. Philip Thompson has a basic job running so
> we are close to (or at) the tipping point where the doneness criteria for
> every ticket needs to include making sure this job covers the thing you
> added/changed. If you aren’t going to add the coverage you need to justify
> (to yourself and your reviewer) breaking it out into something separate and
> file a JIRA indicating the coverage was missing (if one doesn’t already
> exist). Make sure to link it to 9007 so we can see what has already been
> reported.
>
> The reason I say we might not be at the tipping point is that while we
> have the job we haven’t ironed out how stress (or something new) will act
> as a container for validating multiple features. Especially in an
> environment where things like cluster/node failures and topology changes
> occur.
>
> Retrospectives aren’t supposed to include the preceding paragraphs we
> should funnel discussion about them into a separate email thread.
>
> On to the retrospective. This is more for me to solicit from information
> from you then for me to push information to you.
>
> Went well
> Positive response to the definition of done
> Lot’s of manpower from QA and progress on test infrastructure
> Went poorly
> Some wanting to add validation to a kitchen sink style test, but not being
> able to yet
> Not having a way to know if we are effectively implementing the definition
> of done without waiting for bugs as feedback
> Changes
> Coordinate with Philip Thompson to see how we can get to having developers
> able to add validation to the kitchen sink style test
>
> Regards,
> Ariel
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced

Re: March 2015 QA retrospective

Reply via email to