+1, this sounds good to me. I believe the next step would be to open a PR to add this to the release guide: https://github.com/apache/beam/blob/master/website/src/contribute/release-guide.md
On Wed, Jan 9, 2019 at 12:04 PM Sam Rohde <[email protected]> wrote: > Cool, thanks for all of the replies. Does this summary sound reasonable? > > *Problem:* there are a number of failing tests (including flaky) that > don't get looked at, and aren't necessarily green upon cutting a new Beam > release. > > *Proposed Solution:* > > - Add all tests to the release validation > - For all failing tests (including flaky) create a JIRA attached to > the Beam release and add to the "test-failures" component* > - If a test is continuously failing > - fix it > - add fix to release > - close out JIRA > - If a test is flaky > - try and fix it > - If fixed > - add fix to release > - close out JIRA > - else > - manually test it > - modify "Fix Version" to next release > - The release validation can continue when all JIRAs are closed > out. > > *Why this is an improvement:* > > - Ensures that every test is a valid signal (as opposed to disabling > failing tests) > - Creates an incentive to automate tests (no longer on the hook to > manually test) > - Creates a forcing-function to fix flaky tests (once fixed, no longer > needs to be manually tested) > - Ensures that every failing test gets looked at > > *Why this may not be an improvement:* > > - More effort for release validation > - May slow down release velocity > > * for brevity, this might be better to create a JIRA per component > containing a summary of failing tests > > > -Sam > > > > > On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <[email protected]> wrote: > >> >> >> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <[email protected]> wrote: >> >>> >>> >>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <[email protected]> wrote: >>> >>>> For reference, there are currently 34 unresolved JIRA issues under the >>>> test-failures component [1]. >>>> >>>> [1] >>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC >>>> >>> >>> And there are 19 labeled with flake or sickbay: >>> https://issues.apache.org/jira/issues/?filter=12343195 >>> >>> >>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <[email protected]> wrote: >>>> >>>>> This is a a good idea. Some suggestions: >>>>> - It would be nicer if we can figure out process to act on flaky test >>>>> more frequently than releases. >>>>> >>>> >>> Any ideas? We could just have some cadence and try to establish the >>> practice of having a deflake thread every couple of weeks? How about we add >>> it to release verification as a first step and then continue to discuss? >>> >> >> Sounds great. I do not know enough JIRA, but I am hoping that a solution >> can come in the form of tooling. If we could configure JIRA with SLOs per >> issue type, we could have customized reports on which issues are not >> getting enough attention and then do a load balance among us. >> >> >>> >>> - Another improvement in the process would be having actual owners of >>>>> issues rather than auto assigned component owners. A few folks have 100+ >>>>> assigned issues. Unassigning those issues, and finding owners who would >>>>> have time to work on identified flaky tests would be helpful. >>>>> >>>> >>> Yikes. Two issues here: >>> >>> - sounds like Jira component owners aren't really working for us as a >>> first point of contact for triage >>> - a person shouldn't really have more than 5 Jira assigned, or if you >>> get really loose maybe 20 (I am guilty of having 30 at this moment...) >>> >>> Maybe this is one or two separate threads? >>> >> >> I can fork this to another thread. I think both issues are related >> because components owners are more likely to be in this situaion. I agree >> with assessment of two issues. >> >> >>> >>> Kenn >>> >>> >>>> >>>>> >>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> >>>>>> I love this idea. It can easily feel like bugs filed for Jenkins >>>>>> flakes/failures just get lost if there is no process for looking them >>>>>> over >>>>>> regularly. >>>>>> >>>>>> I would suggest that test failures / flakes all get filed with Fix >>>>>> Version = whatever release is next. Then at release time we can triage >>>>>> the >>>>>> list, making sure none might be a symptom of something that should block >>>>>> the release. One modification to your proposal is that after manual >>>>>> verification that it is safe to release I would move Fix Version to the >>>>>> next release instead of closing, unless the issue really is fixed or >>>>>> otherwise not reproducible. >>>>>> >>>>>> For automation, I wonder if there's something automatic already >>>>>> available somewhere that would: >>>>>> >>>>>> - mark the Jenkins build to "Keep This Build Forever" >>>>>> - be *very* careful to try to find an existing bug, else it will be >>>>>> spam >>>>>> - file bugs to "test-failures" component >>>>>> - set Fix Version to the "next" - right now we have 2.7.1 (LTS), >>>>>> 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) so need the >>>>>> smarts to choose 2.11.0 >>>>>> >>>>>> If not, I think doing this stuff manually is not that bad, assuming >>>>>> we can stay fairly green. >>>>>> >>>>>> Kenn >>>>>> >>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <[email protected]> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> There are a number of tests in our system that are either flaky or >>>>>>> permanently red. I am suggesting to add, if not all, then most of the >>>>>>> tests >>>>>>> (style, unit, integration, etc) to the release validation step. In this >>>>>>> way, we will add a regular cadence to ensuring greenness and no flaky >>>>>>> tests >>>>>>> in Beam. >>>>>>> >>>>>>> There are a number of ways of implementing this, but what I think >>>>>>> might work the best is to set up a process that either manually or >>>>>>> automatically creates a JIRA for the failing test and assigns it to a >>>>>>> component tagged with the release number. The release can then continue >>>>>>> when all JIRAs are closed by either fixing the failure or manually >>>>>>> testing >>>>>>> to ensure no adverse side effects (this is in case there are >>>>>>> environmental >>>>>>> issues in the testing infrastructure or otherwise). >>>>>>> >>>>>>> Thanks for reading, what do you think? >>>>>>> - Is there another, easier way to ensure that no test failures go >>>>>>> unfixed? >>>>>>> - Can the process be automated? >>>>>>> - What am I missing? >>>>>>> >>>>>>> Regards, >>>>>>> Sam >>>>>>> >>>>>>> >>>> >>>> -- >>>> >>>> >>>> >>>> >>>> Got feedback? tinyurl.com/swegner-feedback >>>> >>> -- Got feedback? tinyurl.com/swegner-feedback
