Re: March 2015 QA retrospective

Ariel Weisberg Thu, 09 Apr 2015 10:44:20 -0700

Hi,

Thanks Philip.


For items in went poorly, is there anything we could change or should have
done differently to get contributor branches running in CI? I don’t recall
us setting a fixed goal for when that had to be done.

There are two angles to evaluate this from and decide if there is a
problem. We had a schedule goal and we didn’t make it. Or we didn’t have a
schedule goal, but we committed resources and didn’t make progress.

If it's just not done and we are OK with that it wouldn't go under went
poorly (it's just how it went).

*Retrospecting the retrospective*

The retrospective didn’t play out the way I hoped and I think it may be
because I didn’t communicate how we are supposed to use it in enough
detail. For everyone to have nothing to add in the went poorly column means
that in the period of time covered by the retrospective we shipped no bug
fixes for bugs that we think we should have been caught before release.

We know that isn’t true because in JIRA for 2.1.3 there were 114 resolved
bugs. Some of those are going to be things that were addressed before being
released, but a good chunk have to be fixes for issues in previous
releases. The reason we need to do this through a retrospective is that
it’s a lot of work for me and I lack perspective on what is going on with
each individual issue. Having a PHB chase everyone down and find out what
they are doing is inefficient and doesn’t work because it’s lacks
repeatability/scalability and it builds experienced for me, but not for the
team.

*Evaluating 2.13 issues:*
Right now we have what you might call a target rich environment. Many
features aren’t exercised in a realistic way or evaluated by success
criteria (performance, space utilization) that users care about. The first
few retrospectives should be the noisiest as we work through that backlog.

We may be creating regression tests as part of the bug fixes we shipped (I
sure hope we did), but regression tests are != to the kind of tests we
could have written to avoid shipping the bug in the first place. Tests
which have the potential to catch more than just the individual bug in
question.

I’m going to drag into the went poorly category fixes from 2.1.3 and I
would like to have the involved parties (fundamentally this is the
responsibility of the assignee) chime in on why the bug was released in the
first place and what we could kind of test we could have done before
release to catch it. I removed a number of issues that were enhancements,
bugs fixed before release due to successful process, or wontfixed/not
reproducable.

Reasons for revisiting are typically a missing regression test, missing
test that we could run now to detect this class of problem in the future,
and most importantly (and also hardest to nail down), what could we do in
the future when doing that kind of work to build effective tests before
release.

One of things I pick on for some of these is inadequate testing of boundary
conditions, inadequate testing with interrelated components (always hard to
identify and can change over time). Not testing under sufficient load or
with a representative data model or data set is also an issue for some of
these.

*Homework:*
If you are listed as an assignee you need to triage the ticket. Based on
the experience with that bug are we doing sufficient testing now, and what
kind of testing could have been done before release to find the issue
without the benefit of hindsight.

Every issue needs a response even if the response is "no work to be done."
If there is work to be done it has to find its way into our testing
strategy (submit a JIRA, or bring it up here).

*Went poorly:*
   *Key* *Assignee* *Summary* *Revisit reason*  CASSANDRA-7538
<https://issues.apache.org/jira/browse/CASSANDRA-7538> Sam Tunnicliffe Truncate
of a CF should also delete Paxos CF Truncate not tested with PAXOS, what
else?  CASSANDRA-7704 <https://issues.apache.org/jira/browse/CASSANDRA-7704>
Benedict FileNotFoundException during STREAM-OUT triggers 100% CPU
usage Streaming
testing didn't reproduce this before release  CASSANDRA-7801
<https://issues.apache.org/jira/browse/CASSANDRA-7801> Sylvain Lebresne A
successful INSERT with CAS does not always store data in the DB after a
DELETE Multiple access paths for data not tested together  CASSANDRA-7910
<https://issues.apache.org/jira/browse/CASSANDRA-7910> Tyler Hobbs wildcard
prepared statements are incorrect after a column is added to the table Alter
table not tested concurrently with ?  CASSANDRA-8018
<https://issues.apache.org/jira/browse/CASSANDRA-8018> Benjamin Lerer Cassandra
seems to insert twice in custom PerColumnSecondaryIndex Custom secondary
indexes not tested before release?  CASSANDRA-8028
<https://issues.apache.org/jira/browse/CASSANDRA-8028> Carl Yeksigian Unable
to compute when histogram overflowed Histogram output not tested with
representative data sets, no regression test  CASSANDRA-8122
<https://issues.apache.org/jira/browse/CASSANDRA-8122> Carl Yeksigian Undeclare
throwable exception while executing 'nodetool netstats localhost' nodetool
not tested against cluster throughout lifecycle, no regression test
CASSANDRA-8211 <https://issues.apache.org/jira/browse/CASSANDRA-8211> Marcus
Eriksson Overlapping sstables in L1+ Noted hard to reproduce, but still is
there a way we could have, no regression test  CASSANDRA-8231
<https://issues.apache.org/jira/browse/CASSANDRA-8231> Benjamin Lerer Wrong
size of cached prepared statements Expected cache capacity not validated
with actual cache capcaity, no regression test  CASSANDRA-8243
<https://issues.apache.org/jira/browse/CASSANDRA-8243> Björn Hegerfors DTCS
can leave time-overlaps, limiting ability to expire entire SSTables Performance
improving fast path not tested in a representative way  CASSANDRA-8264
<https://issues.apache.org/jira/browse/CASSANDRA-8264> Tyler Hobbs Problems
with multicolumn relations and COMPACT STORAGE How can we catch
interactions like compact storage not being covered by the test
CASSANDRA-8280 <https://issues.apache.org/jira/browse/CASSANDRA-8280> Sam
Tunnicliffe Cassandra crashing on inserting data over 64K into indexed
strings Added tests are good example, could focusing on testing all access
paths and boundary conditions per access path have prevented this
CASSANDRA-8285 <https://issues.apache.org/jira/browse/CASSANDRA-8285> Aleksey
Yeschenko Move all hints related tasks to hints private executor Pierre's
reproducer represents something we weren't doing, but that users are. Is
that now being tested?  CASSANDRA-8286
<https://issues.apache.org/jira/browse/CASSANDRA-8286> Tyler Hobbs Regression
in ORDER BY There were tests that failed in some versions, but not all? Did
this not ship?  CASSANDRA-8288
<https://issues.apache.org/jira/browse/CASSANDRA-8288> Tyler Hobbs cqlsh
describe needs to show 'sstable_compression': '' Roundtrip test for
describe schema?  CASSANDRA-8292
<https://issues.apache.org/jira/browse/CASSANDRA-8292> Joshua McKenzie From
Pig: org.apache.cassandra.exceptions.ConfigurationException: Expecting URI
in variable: [cassandra.config]. Please prefix the file with file:/// for
local files or file://<server>/ for remote files. PIG not tested
CASSANDRA-8302 <https://issues.apache.org/jira/browse/CASSANDRA-8302> Tyler
Hobbs Filtering for CONTAINS (KEY) on frozen collection clustering columns
within a partition does not work More untested combinations, could we have
spotted that there was an interaction and tested it? Or did this not ship?
CASSANDRA-8316 <https://issues.apache.org/jira/browse/CASSANDRA-8316> Marcus
Eriksson "Did not get positive replies from all endpoints" error on
incremental repair What were users doing differently, is there a reproducer
for this running now?  CASSANDRA-8320
<https://issues.apache.org/jira/browse/CASSANDRA-8320> Marcus Eriksson 2.1.2:
NullPointerException in SSTableWriter What were users doing that caused
this, are we doing that?  CASSANDRA-8332
<https://issues.apache.org/jira/browse/CASSANDRA-8332> T Jake Luciani Null
pointer after droping keyspace Add/drop keyspace not tested under load,
with server logs checked for errors  CASSANDRA-8365
<https://issues.apache.org/jira/browse/CASSANDRA-8365> Benjamin Lerer CamelCase
name is used as index name instead of lowercase How can we establish UI
consistency?  CASSANDRA-8370
<https://issues.apache.org/jira/browse/CASSANDRA-8370> Sam Tunnicliffe cqlsh
doesn't handle LIST statements correctly cqlsh untested functionality, no
regression test?  CASSANDRA-8383
<https://issues.apache.org/jira/browse/CASSANDRA-8383> Benedict Memtable
flush may expire records from the commit log that are in a later memtable No
regression test, no follow up ticket. Could/should this have been
reproducable as an actual bug?  CASSANDRA-8386
<https://issues.apache.org/jira/browse/CASSANDRA-8386> Marcus Eriksson Make
sure we release references to sstables after incremental repair Is there a
higher level test that could have observed this failure?  CASSANDRA-8401
<https://issues.apache.org/jira/browse/CASSANDRA-8401> Jonathan Ellis dropping
a CF doesn't remove the latency-sampling task Another argument for a schema
change stress test, maybe tracking for constant memory utilization
CASSANDRA-8408 <https://issues.apache.org/jira/browse/CASSANDRA-8408> Tyler
Hobbs limit appears to replace page size under certain conditions No test
that validates that paging returns the expected number of results? Another
of the genre of queries we support but don't test all the combinations
CASSANDRA-8410 <https://issues.apache.org/jira/browse/CASSANDRA-8410> Tyler
Hobbs Select with many IN values on clustering columns can result in a
StackOverflowError Another missing boundary conditions test, test maximum
size in clause against *  CASSANDRA-8421
<https://issues.apache.org/jira/browse/CASSANDRA-8421> Benjamin Lerer Cassandra
2.1.1 & Cassandra 2.1.2 UDT not returning value for LIST type as UDT Is
there a test that could have found this condition before release?
CASSANDRA-8429 <https://issues.apache.org/jira/browse/CASSANDRA-8429>
Benedict Some keys unreadable during compaction Running stress in CI would
have caught this, and we're going to do that  CASSANDRA-8432
<https://issues.apache.org/jira/browse/CASSANDRA-8432> Marcus Eriksson
Standalone
Scrubber broken for LCS Standalone scrubber not tested, no regression test
CASSANDRA-8448 <https://issues.apache.org/jira/browse/CASSANDRA-8448> Brandon
Williams "Comparison method violates its general contract" in
AbstractEndpointSnitch This just happens periodically? Was the snitch no
tested under load and the log output checked for errors?  CASSANDRA-8451
<https://issues.apache.org/jira/browse/CASSANDRA-8451> Tyler Hobbs NPE when
writetime() or ttl() are nested inside function call Is this testable? Can
we check that functions compose correctly or validate that they are
inherently composable. No regression test.  CASSANDRA-8458
<https://issues.apache.org/jira/browse/CASSANDRA-8458> Marcus Eriksson Don't
give out positions in an sstable beyond its first/last tokens Streaming not
done in realistic scenario with validation of logging  CASSANDRA-8459
<https://issues.apache.org/jira/browse/CASSANDRA-8459> Benedict
"autocompaction"
on reads can prevent memtable space reclaimation What would have reproduced
this before release?  CASSANDRA-8462
<https://issues.apache.org/jira/browse/CASSANDRA-8462> Aleksey
Yeschenko Upgrading
a 2.0 to 2.1 breaks CFMetaData on 2.0 nodes Have additional dtest coverage,
need to do this in kitchen sink tests  CASSANDRA-8463
<https://issues.apache.org/jira/browse/CASSANDRA-8463> Marcus Eriksson Constant
compaction under LCS What would have reproduced this before release?
CASSANDRA-8490 <https://issues.apache.org/jira/browse/CASSANDRA-8490> Tyler
Hobbs DISTINCT queries with LIMITs or paging are incorrect when partitions
are deleted Untested query forms, no regression test  CASSANDRA-8499
<https://issues.apache.org/jira/browse/CASSANDRA-8499> Benedict Ensure
SSTableWriter cleans up properly after failure Testing error paths? Any way
to test things in a loop to detect leaks?  CASSANDRA-8510
<https://issues.apache.org/jira/browse/CASSANDRA-8510> Marcus Eriksson
CompactionManager.submitMaximal
may leak resources Not a user visible problem, so difficult to catch in
test, but is there a way  CASSANDRA-8512
<https://issues.apache.org/jira/browse/CASSANDRA-8512> Tyler Hobbs cqlsh
unusable after encountering schema mismatch cqlsh not tested with other
functionality active  CASSANDRA-8513
<https://issues.apache.org/jira/browse/CASSANDRA-8513> Benedict SSTableScanner
may not acquire reference, but will still release it when closed This had a
user visible component, what test could have caught it befor erelease?
CASSANDRA-8514 <https://issues.apache.org/jira/browse/CASSANDRA-8514> Benjamin
Lerer ArrayIndexOutOfBoundsException in nodetool cfhistograms Not released,
but not caught by automated tests either  CASSANDRA-8525
<https://issues.apache.org/jira/browse/CASSANDRA-8525> Marcus Eriksson Bloom
Filter truePositive counter not updated on key cache hit User visible
metric not accurate, but only in one config. Possible to guess correct FP
ratio and validate while exploring config space?  CASSANDRA-8532
<https://issues.apache.org/jira/browse/CASSANDRA-8532> Marcus Eriksson Fix
calculation of expected write size during compaction Did this manifest as a
user visible issue, could we have tested for that?  CASSANDRA-8537
<https://issues.apache.org/jira/browse/CASSANDRA-8537> Marcus Eriksson
ConcurrentModificationException
while executing 'nodetool cleanup' Nodetool cleanup not tested before
release  CASSANDRA-8550
<https://issues.apache.org/jira/browse/CASSANDRA-8550> Tyler Hobbs Internal
pagination in CQL3 index queries creating substantial overhead Pagination
not performance tested with representative data models  CASSANDRA-8558
<https://issues.apache.org/jira/browse/CASSANDRA-8558> Sylvain Lebresne deleted
row still can be selected out Validate that deleted data stays deleted
under * conditions (big matrix of interactions here with different
configurations, streaming, repair, cleanup, scrub). Deleted data coming
back shows up a lot.  CASSANDRA-8562
<https://issues.apache.org/jira/browse/CASSANDRA-8562> Marcus Eriksson Fix
checking available disk space before compaction starts Is there a user
visible negative impact, could it have been tested for?  CASSANDRA-8563
<https://issues.apache.org/jira/browse/CASSANDRA-8563> Tyler Hobbs cqlsh
broken for some thrift created tables. Validate mixed CQL thrift
interactions? Possibly abstract everything to be done either by CQL or
Thrift and then permute? Seems low value, but necessary if both are claimed
to be supported.  CASSANDRA-8577
<https://issues.apache.org/jira/browse/CASSANDRA-8577> Artem Aliev Values
of set types not loading correctly into Pig Full set of interactions with
PIG not validated  CASSANDRA-8579
<https://issues.apache.org/jira/browse/CASSANDRA-8579> Jimmy Mårdell
sstablemetadata
can't load org.apache.cassandra.tools.SSTableMetadataViewer Running C* from
source tree not representative of behavior of deployed builds
CASSANDRA-8580 <https://issues.apache.org/jira/browse/CASSANDRA-8580> Marcus
Eriksson AssertionErrors after activating unchecked_tombstone_compaction
with leveled compaction How could this have been reproduced before release?
No regression test  CASSANDRA-8588
<https://issues.apache.org/jira/browse/CASSANDRA-8588> Dave Brosius Fix
DropTypeStatements isusedBy for maps (typo ignored values) Not released,
but was it detected before release by an automated test?  CASSANDRA-8619
<https://issues.apache.org/jira/browse/CASSANDRA-8619> Benedict using
CQLSSTableWriter gives ConcurrentModificationException What kind of test
would have caught this before release?  CASSANDRA-8623
<https://issues.apache.org/jira/browse/CASSANDRA-8623> Marcus Eriksson
sstablesplit
fails *randomly* with Data component is missing Feature not tested before
release? No regression test  CASSANDRA-8632
<https://issues.apache.org/jira/browse/CASSANDRA-8632> Benedict
cassandra-stress
only generating a single unique row We rely on stress for performance
testing, that might mean it needs real testing that demonstrates it
generates load that looks like the load it is supposed to be generating.
CASSANDRA-8635 <https://issues.apache.org/jira/browse/CASSANDRA-8635> Marcus
Eriksson STCS cold sstable omission does not handle overwrites without reads If
this workload is a challenge for certain kinds of optimizations we should
test it if we think it could happen again.  CASSANDRA-8640
<https://issues.apache.org/jira/browse/CASSANDRA-8640> Anthony Cozzie Paxos
requires all nodes for CAS If PAXOS is not supposed to require all nodes
for CAS we should be able to fail nodes or a certain number of nodes and
still continue to CAS (test availability of CAS under failure conditions).
No regression test.  CASSANDRA-8641
<https://issues.apache.org/jira/browse/CASSANDRA-8641> *Unassigned* Repair
causes a large number of tiny SSTables User says something doesn't work for
them? Could we have anticipated that vnodes would not work as formulated
for this case.  CASSANDRA-8652
<https://issues.apache.org/jira/browse/CASSANDRA-8652> Edward Ribeiro DROP
TABLE should also drop BATCH prepared statements associated to it Not sure
if this is an optimization or fixes a user visible issue, but could this
have been detected by exercising the functionality better before release.
CASSANDRA-8668 <https://issues.apache.org/jira/browse/CASSANDRA-8668>
Benedict We don't enforce offheap memory constraints; regression introduced
by 7882 Memory constraints was a supported feature/UI, but not completely
tested before release. Could this have been found most effectively by a
unit test or a blackbox test?  CASSANDRA-8675
<https://issues.apache.org/jira/browse/CASSANDRA-8675> *Unassigned* COPY
TO/FROM broken for newline characters COPY TO/FROM not tested with
representative data  CASSANDRA-8677
<https://issues.apache.org/jira/browse/CASSANDRA-8677> Ariel Weisberg
rpc_interface
and listen_interface generate NPE on startup when specified interface
doesn't exist Missing unit tests checking error messages for
DatabaseDescriptor  CASSANDRA-8687
<https://issues.apache.org/jira/browse/CASSANDRA-8687> Jeremiah Jordan Keyspace
should also check Config.isClientMode Is there a way to test for missing
Config.isClientMode checks?  CASSANDRA-8688
<https://issues.apache.org/jira/browse/CASSANDRA-8688> Yuki Morishita
Standalone
sstableupgrade tool throws exception Tool not tested before release, no
regression test  CASSANDRA-8691
<https://issues.apache.org/jira/browse/CASSANDRA-8691> *Unassigned*
SSTableReader.getPosition()
does not correctly filter out queries that exceed its bounds Is there a
scenario where this is user visible, should we test for that?
CASSANDRA-8694 <https://issues.apache.org/jira/browse/CASSANDRA-8694> Jeff
Jirsa Repair of empty keyspace hangs rather than ignoring the request Missing
boundary condition test, requesting operation on empty, non-existent, or
not applicable entity.  CASSANDRA-8695
<https://issues.apache.org/jira/browse/CASSANDRA-8695> Chris Lockfort thrift
column definition list sometimes immutable What user visible activities
reproduced this, could we have done that before release?  CASSANDRA-8719
<https://issues.apache.org/jira/browse/CASSANDRA-8719> Benedict Using
thrift HSHA with offheap_objects appears to corrupt data Untested
configuration before release, this would be straightforward if we ran with
it?  CASSANDRA-8726 <https://issues.apache.org/jira/browse/CASSANDRA-8726>
Benedict throw OOM in Memory if we fail to allocate OOM test Cassandra? Try
and validate that it fails cleanly and can be restarted on OOM? Same for
disk full.  CASSANDRA-8733
<https://issues.apache.org/jira/browse/CASSANDRA-8733> Tyler Hobbs List
prepend reverses item order There was a test so sometimes this just happens.

Thanks,
Ariel

On Apr 2, 2015, at 5:21 PM, Philip Thompson <philip.thomp...@datastax.com>
wrote:

To add to this:


*Went well*
Tyler Hobbs has reduced failing dtests on trunk by ~90%. By next month,
test results should be at 100% pass.

*Went poorly*
We've failed to make progress on running the full test suite across all
contributor branches. By the end of this month, I assume we will at least
have limited functionality in this area.

On Wed, Apr 1, 2015 at 3:57 PM, Ariel Weisberg <ariel.weisb...@datastax.com>
wrote:

Hi all,

It’s time for the first retrospective. For those not familiar this is the
part of the development process where we discuss what is and isn’t working
when it comes to making reliable releases. We go over the things that
worked, the things that didn’t work, and what changes we are going to make.

This is not a forum for discussing individual bugs (or bugs fixed before
release due to successful process) although you can cite one and we can
discuss what we could have done differently to catch it. Even if a bug
wasn’t released if it was caught the wrong way (blind luck) and you think
our process wouldn’t have caught it you can bring that up as well.

I don’t expect this retrospective to be the most productive because we
already know we are far behind in several areas (passing utests, dtests,
running utests and dtests for on each commit, running a larger black box
system test) and many issues will circle back around to being addressed by
one of those three.

If your a developer you can review all things you have committed (or
reviewed) in the past month and ask yourself if it met the criteria of done
that we agreed on including adding tests for existing untested code
(usually the thing missed). Better to do it now then after discovering your
definition of done was flawed because it released a preventible bug.

For this one retrospective you can reach back further to something already
released that you feel passionate about, and if you can point to a utest or
dtest that should have caught it that is still missing we can add that to
the list of things to test. That would go under CASSANDRA-9012 (Triage
missing test coverage) <
https://issues.apache.org/jira/browse/CASSANDRA-9012>.

There is a root JIRA <https://issues.apache.org/jira/browse/CASSANDRA-9042>
for making trunk always releasable. A lot falls under CASSANDRA-9007 ( Run
stress nightly against trunk in a way that validates ) <
https://issues.apache.org/jira/browse/CASSANDRA-9007> which is the root
for a new kitchen sink style test that validates the entire feature set
together in a black box fashion. Philip Thompson has a basic job running so
we are close to (or at) the tipping point where the doneness criteria for
every ticket needs to include making sure this job covers the thing you
added/changed. If you aren’t going to add the coverage you need to justify
(to yourself and your reviewer) breaking it out into something separate and
file a JIRA indicating the coverage was missing (if one doesn’t already
exist). Make sure to link it to 9007 so we can see what has already been
reported.

The reason I say we might not be at the tipping point is that while we
have the job we haven’t ironed out how stress (or something new) will act
as a container for validating multiple features. Especially in an
environment where things like cluster/node failures and topology changes
occur.

Retrospectives aren’t supposed to include the preceding paragraphs we
should funnel discussion about them into a separate email thread.

On to the retrospective. This is more for me to solicit from information
from you then for me to push information to you.

Went well
Positive response to the definition of done
Lot’s of manpower from QA and progress on test infrastructure
Went poorly
Some wanting to add validation to a kitchen sink style test, but not being
able to yet
Not having a way to know if we are effectively implementing the definition
of done without waiting for bugs as feedback
Changes
Coordinate with Philip Thompson to see how we can get to having developers
able to add validation to the kitchen sink style test

Regards,
Ariel

Re: March 2015 QA retrospective

Reply via email to