I like the idea of using test greenness to choose a release commit. There's a couple challenges with our current setup:
1) Post-commits don't run at every commit. The Jenkins jobs are configured to run on pushes to master, but (at least some Jobs) are serialized to run a single Jenkins job instance at a time, and the next run will be at the current HEAD, skipping pushes it didn't get to. So it may be hard/impossible to find a commit which had all Post-Commit jobs run against it. 2) I don't see an easy way in Jenkins or GitHub to find overall test status for a commit across all test jobs which ran. The GitHub history [1] seems to only show badges from PR test runs. Perhaps we're missing something in our Jenkins job config to publish the status back to GitHub. Or, we could import the necessary data into our Community Metrics DB [2] and build our own dashboard [3]. So assuming I'm not missing something, +1 to Sam's proposal to cut-then-validate since that seems much easier to get started with today. [1] https://github.com/apache/beam/commits/master [2] https://github.com/apache/beam/blob/6c2fe17cfdea1be1fdcfb02267894f0d37a671b3/.test-infra/metrics/sync/jenkins/syncjenkins.py#L38 [3] https://s.apache.org/beam-community-metrics On Tue, Jan 15, 2019 at 2:47 PM Kenneth Knowles <[email protected]> wrote: > Since you brought up the entirety of the process, I would suggest to move > the release branch cut up like so: > > - Decide to release > - Create a new version in JIRA > - Find a recent green commit (according to postcommit) > - Create a release branch from that commit > - Bump the version on master (green PR w/ parent at the green commit) > - Triage release-blocking JIRAs > - ... > > Notes: > > - Choosing postcommit signal to cut means we already have the signal and > we aren't tempted to wait on master > - Cutting before triage starts stabilization process ASAP and gives clear > signal on the burndown > > Kenn > > > On Tue, Jan 15, 2019 at 1:25 PM Sam Rohde <[email protected]> wrote: > >> +Boyuan Zhang <[email protected]> who is modifying the rc validation >> script >> >> I'm thinking of a small change to the proposed process brought to my >> attention from Boyuan. >> >> Instead of running the additional validation tests during the rc >> validation, run the tests and the proposed process after the release branch >> has been cut. A couple of reasons why: >> >> - The additional validation tests (PostCommit and PreCommit) don't >> run against the RC and are instead run against the branch. This is >> confusing considering the other tests in the RC validation step are per >> RC. >> - The additional validation tests are expensive. >> >> The final release process would look like: >> >> - Decide to release >> - Create a new version in JIRA >> - Triage release-blocking issue in JIRAs >> - Review release notes in JIRA >> - Create a release branch >> - Verify that a release builds >> - >>> Verify that a release passes its tests <<< (this is where the >> new process would be added) >> - Build/test/fix RCs >> - >>> Fix any issues <<< (all JIRAs created during the new process >> will have to be closed by here) >> - Finalize the release >> - Promote the release >> >> >> >> >> On Thu, Jan 10, 2019 at 4:32 PM Kenneth Knowles <[email protected]> wrote: >> >>> What do you think about crowd-sourcing? >>> >>> 1. Fix Version = 2.10.0 >>> 2. If assigned, ping ticket and maybe assignee, unassign if unresponsive >>> 3. If unassigned, assign it to yourself while thinking about it >>> 4. If you can route it a bit closer to someone who might know, great >>> 5. If it doesn't look like a blocker (after routing best you can), Fix >>> Version = 2.11.0 >>> >>> I think this has enough mutexes that there should be no duplicated work >>> if it is followed. And every step is a standard use of Fix Version and >>> Assignee so there's not really special policy needed. >>> >>> Kenn >>> >>> On Thu, Jan 10, 2019 at 4:25 PM Mikhail Gryzykhin <[email protected]> >>> wrote: >>> >>>> +1 >>>> >>>> Although we should be cautious when enabling this policy. We have >>>> decent backlog of bugs that we need to plumb through. >>>> >>>> --Mikhail >>>> >>>> Have feedback <http://go/migryz-feedback>? >>>> >>>> >>>> On Thu, Jan 10, 2019 at 11:44 AM Scott Wegner <[email protected]> wrote: >>>> >>>>> +1, this sounds good to me. >>>>> >>>>> I believe the next step would be to open a PR to add this to the >>>>> release guide: >>>>> https://github.com/apache/beam/blob/master/website/src/contribute/release-guide.md >>>>> >>>>> On Wed, Jan 9, 2019 at 12:04 PM Sam Rohde <[email protected]> wrote: >>>>> >>>>>> Cool, thanks for all of the replies. Does this summary sound >>>>>> reasonable? >>>>>> >>>>>> *Problem:* there are a number of failing tests (including flaky) >>>>>> that don't get looked at, and aren't necessarily green upon cutting a new >>>>>> Beam release. >>>>>> >>>>>> *Proposed Solution:* >>>>>> >>>>>> - Add all tests to the release validation >>>>>> - For all failing tests (including flaky) create a JIRA attached >>>>>> to the Beam release and add to the "test-failures" component* >>>>>> - If a test is continuously failing >>>>>> - fix it >>>>>> - add fix to release >>>>>> - close out JIRA >>>>>> - If a test is flaky >>>>>> - try and fix it >>>>>> - If fixed >>>>>> - add fix to release >>>>>> - close out JIRA >>>>>> - else >>>>>> - manually test it >>>>>> - modify "Fix Version" to next release >>>>>> - The release validation can continue when all JIRAs are >>>>>> closed out. >>>>>> >>>>>> *Why this is an improvement:* >>>>>> >>>>>> - Ensures that every test is a valid signal (as opposed to >>>>>> disabling failing tests) >>>>>> - Creates an incentive to automate tests (no longer on the hook >>>>>> to manually test) >>>>>> - Creates a forcing-function to fix flaky tests (once fixed, no >>>>>> longer needs to be manually tested) >>>>>> - Ensures that every failing test gets looked at >>>>>> >>>>>> *Why this may not be an improvement:* >>>>>> >>>>>> - More effort for release validation >>>>>> - May slow down release velocity >>>>>> >>>>>> * for brevity, this might be better to create a JIRA per component >>>>>> containing a summary of failing tests >>>>>> >>>>>> >>>>>> -Sam >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <[email protected]> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> For reference, there are currently 34 unresolved JIRA issues under >>>>>>>>> the test-failures component [1]. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC >>>>>>>>> >>>>>>>> >>>>>>>> And there are 19 labeled with flake or sickbay: >>>>>>>> https://issues.apache.org/jira/issues/?filter=12343195 >>>>>>>> >>>>>>>> >>>>>>>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> This is a a good idea. Some suggestions: >>>>>>>>>> - It would be nicer if we can figure out process to act on flaky >>>>>>>>>> test more frequently than releases. >>>>>>>>>> >>>>>>>>> >>>>>>>> Any ideas? We could just have some cadence and try to establish the >>>>>>>> practice of having a deflake thread every couple of weeks? How about >>>>>>>> we add >>>>>>>> it to release verification as a first step and then continue to >>>>>>>> discuss? >>>>>>>> >>>>>>> >>>>>>> Sounds great. I do not know enough JIRA, but I am hoping that a >>>>>>> solution can come in the form of tooling. If we could configure JIRA >>>>>>> with >>>>>>> SLOs per issue type, we could have customized reports on which issues >>>>>>> are >>>>>>> not getting enough attention and then do a load balance among us. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> - Another improvement in the process would be having actual owners >>>>>>>>>> of issues rather than auto assigned component owners. A few folks >>>>>>>>>> have 100+ >>>>>>>>>> assigned issues. Unassigning those issues, and finding owners who >>>>>>>>>> would >>>>>>>>>> have time to work on identified flaky tests would be helpful. >>>>>>>>>> >>>>>>>>> >>>>>>>> Yikes. Two issues here: >>>>>>>> >>>>>>>> - sounds like Jira component owners aren't really working for us >>>>>>>> as a first point of contact for triage >>>>>>>> - a person shouldn't really have more than 5 Jira assigned, or if >>>>>>>> you get really loose maybe 20 (I am guilty of having 30 at this >>>>>>>> moment...) >>>>>>>> >>>>>>>> Maybe this is one or two separate threads? >>>>>>>> >>>>>>> >>>>>>> I can fork this to another thread. I think both issues are related >>>>>>> because components owners are more likely to be in this situaion. I >>>>>>> agree >>>>>>> with assessment of two issues. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I love this idea. It can easily feel like bugs filed for Jenkins >>>>>>>>>>> flakes/failures just get lost if there is no process for looking >>>>>>>>>>> them over >>>>>>>>>>> regularly. >>>>>>>>>>> >>>>>>>>>>> I would suggest that test failures / flakes all get filed with >>>>>>>>>>> Fix Version = whatever release is next. Then at release time we can >>>>>>>>>>> triage >>>>>>>>>>> the list, making sure none might be a symptom of something that >>>>>>>>>>> should >>>>>>>>>>> block the release. One modification to your proposal is that after >>>>>>>>>>> manual >>>>>>>>>>> verification that it is safe to release I would move Fix Version to >>>>>>>>>>> the >>>>>>>>>>> next release instead of closing, unless the issue really is fixed or >>>>>>>>>>> otherwise not reproducible. >>>>>>>>>>> >>>>>>>>>>> For automation, I wonder if there's something automatic already >>>>>>>>>>> available somewhere that would: >>>>>>>>>>> >>>>>>>>>>> - mark the Jenkins build to "Keep This Build Forever" >>>>>>>>>>> - be *very* careful to try to find an existing bug, else it >>>>>>>>>>> will be spam >>>>>>>>>>> - file bugs to "test-failures" component >>>>>>>>>>> - set Fix Version to the "next" - right now we have 2.7.1 >>>>>>>>>>> (LTS), 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) so >>>>>>>>>>> need >>>>>>>>>>> the smarts to choose 2.11.0 >>>>>>>>>>> >>>>>>>>>>> If not, I think doing this stuff manually is not that bad, >>>>>>>>>>> assuming we can stay fairly green. >>>>>>>>>>> >>>>>>>>>>> Kenn >>>>>>>>>>> >>>>>>>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi All, >>>>>>>>>>>> >>>>>>>>>>>> There are a number of tests in our system that are either flaky >>>>>>>>>>>> or permanently red. I am suggesting to add, if not all, then most >>>>>>>>>>>> of the >>>>>>>>>>>> tests (style, unit, integration, etc) to the release validation >>>>>>>>>>>> step. In >>>>>>>>>>>> this way, we will add a regular cadence to ensuring greenness and >>>>>>>>>>>> no flaky >>>>>>>>>>>> tests in Beam. >>>>>>>>>>>> >>>>>>>>>>>> There are a number of ways of implementing this, but what I >>>>>>>>>>>> think might work the best is to set up a process that either >>>>>>>>>>>> manually or >>>>>>>>>>>> automatically creates a JIRA for the failing test and assigns it >>>>>>>>>>>> to a >>>>>>>>>>>> component tagged with the release number. The release can then >>>>>>>>>>>> continue >>>>>>>>>>>> when all JIRAs are closed by either fixing the failure or manually >>>>>>>>>>>> testing >>>>>>>>>>>> to ensure no adverse side effects (this is in case there are >>>>>>>>>>>> environmental >>>>>>>>>>>> issues in the testing infrastructure or otherwise). >>>>>>>>>>>> >>>>>>>>>>>> Thanks for reading, what do you think? >>>>>>>>>>>> - Is there another, easier way to ensure that no test failures >>>>>>>>>>>> go unfixed? >>>>>>>>>>>> - Can the process be automated? >>>>>>>>>>>> - What am I missing? >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Sam >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Got feedback? tinyurl.com/swegner-feedback >>>>>>>>> >>>>>>>> >>>>> >>>>> -- >>>>> >>>>> >>>>> >>>>> >>>>> Got feedback? tinyurl.com/swegner-feedback >>>>> >>>> -- Got feedback? tinyurl.com/swegner-feedback
