+1 on the proposal. Adding more to it.

A lot of time has been spent on improving the test runtime and bringing down 
the flaky tests.
Following jiras should give an overview of the effort involved
https://issues.apache.org/jira/browse/HIVE-14547
https://issues.apache.org/jira/browse/HIVE-13503

Committers please ensure that the reported failures are absolutely not related
to the patch before committing it.

I would also propose the following to maintain a clean and some tips to 
maintain fast test runs

1) Revert patch that is causing a failure. It should be the responsibility of
the contributor to make sure the patch is not causing any failures. I am 
against creating follow ups
for fixing test failures usually because it gets ignored or it gets lower 
priority causing wasted effort
and time for failure analysis for every other developers waiting to commit 
their patch.

2) +1 from reviewers AFTER a clean green run. Or if a reviewer is convinced 
that test failures are unrelated.
May be we should stop conditional +1s and wait for clean green run.

3) Avoid slow tests (there is jira to print out runtime of newly added tests). 
In general, its good
to have many smaller tests as opposed to single big tests. If the qfile or 
junit test is big, splitting it
up will help in parallelizing them and avoiding stragglers.

4) Avoid adding tests to MiniMr (slowest of all).

5) Try to keep the test runtime (see surefire-report to get correct runtime 
without initialization time) under a minute.

6) Avoid adding more read-only tables to init script as this will increase the 
initialization time.

7) If the test case does not require explain plan then avoid it as most 
failures are explain diffs.

8) If the test case requires explain and if it does not depend on table or 
partition stats explicitly set stats for the table or partition.
Explicitly setting stats will avoid expensive stats computation time and avoids 
flakiness due to stats diff.

9) Prefer jUnit over qtest.

10) Add explicitly timeout for jUnit test to avoid indefinite hanging of tests 
(surefire timeouts after 40 mins)

Thoughts?

Thanks
Prasanth

On Oct 13, 2016, at 11:10 PM, Siddharth Seth 
<ss...@apache.org<mailto:ss...@apache.org>> wrote:

There's been a lot of work to make the test runs faster, as well as more
reliable via HIVE-14547, HIVE-13503, and several other jiras. Test runtimes
are around the 1 hour mark, and going down. There were a few green
pre-commit runs (after years?). At the same time, there's still some flaky
tests.

We really should try to keep the test runtimes down, as well as the number
of failures - so that the pre-commit runs can provide useful information.

I'm not sure what the current approach w.r.t precommit runs before a
commit. What I've seen in other projects is that the pre-commit needs to
run, and come back clean (mostly) before a commit goes in. Between what
used to be 5 day wait times, and inconsistent runs - I don't think this is
always followed in Hive.

It'll be useful to start relying on pre-commit test results again. Given
the flaky tests, I'd suggest the following
1. Pre-commit must be run on a patch before committing (with very few
exceptions)
2. A green test run is ideal
3. In case there are failures - keep track of these as sub-jiras under a
flaky test umbrella jira (Some under HIVE-14547 already) - to be eventually
fixed.
4. Before committing - cite relevant jiras for a flaky test (create and
cite if it doesn't already exist).

This should help us build up a list of flaky tests over various runs, which
will hopefully get fixed at some point.

Thoughts?

Thanks,
Sid

Reply via email to