Good points. I wasn't tuned in to those nuances of how the jobs are run. I think we *could* cause a postcommit job to run against exactly that commit hash instead of origin/master, but I won't advocate for that. My suggestion of the "find a green commit" approach is a holdover from continuously shipping services. It isn't terrifically important when you have a release branch process.
The bit I feel strongly about is that we should not wait on commits to master except in catastrophic circumstances. I'm happy with cut-then-verify/triage. Kenn On Wed, Jan 16, 2019 at 9:11 AM Scott Wegner <[email protected]> wrote: > I like the idea of using test greenness to choose a release commit. > There's a couple challenges with our current setup: > > 1) Post-commits don't run at every commit. The Jenkins jobs are configured > to run on pushes to master, but (at least some Jobs) are serialized to run > a single Jenkins job instance at a time, and the next run will be at the > current HEAD, skipping pushes it didn't get to. So it may be > hard/impossible to find a commit which had all Post-Commit jobs run against > it. > > 2) I don't see an easy way in Jenkins or GitHub to find overall test > status for a commit across all test jobs which ran. The GitHub history [1] > seems to only show badges from PR test runs. Perhaps we're missing > something in our Jenkins job config to publish the status back to GitHub. > Or, we could import the necessary data into our Community Metrics DB [2] > and build our own dashboard [3]. > > So assuming I'm not missing something, +1 to Sam's proposal to > cut-then-validate since that seems much easier to get started with today. > > [1] https://github.com/apache/beam/commits/master > [2] > https://github.com/apache/beam/blob/6c2fe17cfdea1be1fdcfb02267894f0d37a671b3/.test-infra/metrics/sync/jenkins/syncjenkins.py#L38 > [3] https://s.apache.org/beam-community-metrics > > On Tue, Jan 15, 2019 at 2:47 PM Kenneth Knowles <[email protected]> wrote: > >> Since you brought up the entirety of the process, I would suggest to move >> the release branch cut up like so: >> >> - Decide to release >> - Create a new version in JIRA >> - Find a recent green commit (according to postcommit) >> - Create a release branch from that commit >> - Bump the version on master (green PR w/ parent at the green commit) >> - Triage release-blocking JIRAs >> - ... >> >> Notes: >> >> - Choosing postcommit signal to cut means we already have the signal and >> we aren't tempted to wait on master >> - Cutting before triage starts stabilization process ASAP and gives >> clear signal on the burndown >> >> Kenn >> >> >> On Tue, Jan 15, 2019 at 1:25 PM Sam Rohde <[email protected]> wrote: >> >>> +Boyuan Zhang <[email protected]> who is modifying the rc validation >>> script >>> >>> I'm thinking of a small change to the proposed process brought to my >>> attention from Boyuan. >>> >>> Instead of running the additional validation tests during the rc >>> validation, run the tests and the proposed process after the release branch >>> has been cut. A couple of reasons why: >>> >>> - The additional validation tests (PostCommit and PreCommit) don't >>> run against the RC and are instead run against the branch. This is >>> confusing considering the other tests in the RC validation step are per >>> RC. >>> - The additional validation tests are expensive. >>> >>> The final release process would look like: >>> >>> - Decide to release >>> - Create a new version in JIRA >>> - Triage release-blocking issue in JIRAs >>> - Review release notes in JIRA >>> - Create a release branch >>> - Verify that a release builds >>> - >>> Verify that a release passes its tests <<< (this is where the >>> new process would be added) >>> - Build/test/fix RCs >>> - >>> Fix any issues <<< (all JIRAs created during the new process >>> will have to be closed by here) >>> - Finalize the release >>> - Promote the release >>> >>> >>> >>> >>> On Thu, Jan 10, 2019 at 4:32 PM Kenneth Knowles <[email protected]> wrote: >>> >>>> What do you think about crowd-sourcing? >>>> >>>> 1. Fix Version = 2.10.0 >>>> 2. If assigned, ping ticket and maybe assignee, unassign if unresponsive >>>> 3. If unassigned, assign it to yourself while thinking about it >>>> 4. If you can route it a bit closer to someone who might know, great >>>> 5. If it doesn't look like a blocker (after routing best you can), Fix >>>> Version = 2.11.0 >>>> >>>> I think this has enough mutexes that there should be no duplicated work >>>> if it is followed. And every step is a standard use of Fix Version and >>>> Assignee so there's not really special policy needed. >>>> >>>> Kenn >>>> >>>> On Thu, Jan 10, 2019 at 4:25 PM Mikhail Gryzykhin <[email protected]> >>>> wrote: >>>> >>>>> +1 >>>>> >>>>> Although we should be cautious when enabling this policy. We have >>>>> decent backlog of bugs that we need to plumb through. >>>>> >>>>> --Mikhail >>>>> >>>>> Have feedback <http://go/migryz-feedback>? >>>>> >>>>> >>>>> On Thu, Jan 10, 2019 at 11:44 AM Scott Wegner <[email protected]> >>>>> wrote: >>>>> >>>>>> +1, this sounds good to me. >>>>>> >>>>>> I believe the next step would be to open a PR to add this to the >>>>>> release guide: >>>>>> https://github.com/apache/beam/blob/master/website/src/contribute/release-guide.md >>>>>> >>>>>> On Wed, Jan 9, 2019 at 12:04 PM Sam Rohde <[email protected]> wrote: >>>>>> >>>>>>> Cool, thanks for all of the replies. Does this summary sound >>>>>>> reasonable? >>>>>>> >>>>>>> *Problem:* there are a number of failing tests (including flaky) >>>>>>> that don't get looked at, and aren't necessarily green upon cutting a >>>>>>> new >>>>>>> Beam release. >>>>>>> >>>>>>> *Proposed Solution:* >>>>>>> >>>>>>> - Add all tests to the release validation >>>>>>> - For all failing tests (including flaky) create a JIRA attached >>>>>>> to the Beam release and add to the "test-failures" component* >>>>>>> - If a test is continuously failing >>>>>>> - fix it >>>>>>> - add fix to release >>>>>>> - close out JIRA >>>>>>> - If a test is flaky >>>>>>> - try and fix it >>>>>>> - If fixed >>>>>>> - add fix to release >>>>>>> - close out JIRA >>>>>>> - else >>>>>>> - manually test it >>>>>>> - modify "Fix Version" to next release >>>>>>> - The release validation can continue when all JIRAs are >>>>>>> closed out. >>>>>>> >>>>>>> *Why this is an improvement:* >>>>>>> >>>>>>> - Ensures that every test is a valid signal (as opposed to >>>>>>> disabling failing tests) >>>>>>> - Creates an incentive to automate tests (no longer on the hook >>>>>>> to manually test) >>>>>>> - Creates a forcing-function to fix flaky tests (once fixed, no >>>>>>> longer needs to be manually tested) >>>>>>> - Ensures that every failing test gets looked at >>>>>>> >>>>>>> *Why this may not be an improvement:* >>>>>>> >>>>>>> - More effort for release validation >>>>>>> - May slow down release velocity >>>>>>> >>>>>>> * for brevity, this might be better to create a JIRA per component >>>>>>> containing a summary of failing tests >>>>>>> >>>>>>> >>>>>>> -Sam >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> For reference, there are currently 34 unresolved JIRA issues >>>>>>>>>> under the test-failures component [1]. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC >>>>>>>>>> >>>>>>>>> >>>>>>>>> And there are 19 labeled with flake or sickbay: >>>>>>>>> https://issues.apache.org/jira/issues/?filter=12343195 >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> This is a a good idea. Some suggestions: >>>>>>>>>>> - It would be nicer if we can figure out process to act on flaky >>>>>>>>>>> test more frequently than releases. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> Any ideas? We could just have some cadence and try to establish >>>>>>>>> the practice of having a deflake thread every couple of weeks? How >>>>>>>>> about we >>>>>>>>> add it to release verification as a first step and then continue to >>>>>>>>> discuss? >>>>>>>>> >>>>>>>> >>>>>>>> Sounds great. I do not know enough JIRA, but I am hoping that a >>>>>>>> solution can come in the form of tooling. If we could configure JIRA >>>>>>>> with >>>>>>>> SLOs per issue type, we could have customized reports on which issues >>>>>>>> are >>>>>>>> not getting enough attention and then do a load balance among us. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> - Another improvement in the process would be having actual owners >>>>>>>>>>> of issues rather than auto assigned component owners. A few folks >>>>>>>>>>> have 100+ >>>>>>>>>>> assigned issues. Unassigning those issues, and finding owners who >>>>>>>>>>> would >>>>>>>>>>> have time to work on identified flaky tests would be helpful. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> Yikes. Two issues here: >>>>>>>>> >>>>>>>>> - sounds like Jira component owners aren't really working for us >>>>>>>>> as a first point of contact for triage >>>>>>>>> - a person shouldn't really have more than 5 Jira assigned, or if >>>>>>>>> you get really loose maybe 20 (I am guilty of having 30 at this >>>>>>>>> moment...) >>>>>>>>> >>>>>>>>> Maybe this is one or two separate threads? >>>>>>>>> >>>>>>>> >>>>>>>> I can fork this to another thread. I think both issues are related >>>>>>>> because components owners are more likely to be in this situaion. I >>>>>>>> agree >>>>>>>> with assessment of two issues. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Kenn >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I love this idea. It can easily feel like bugs filed for >>>>>>>>>>>> Jenkins flakes/failures just get lost if there is no process for >>>>>>>>>>>> looking >>>>>>>>>>>> them over regularly. >>>>>>>>>>>> >>>>>>>>>>>> I would suggest that test failures / flakes all get filed with >>>>>>>>>>>> Fix Version = whatever release is next. Then at release time we >>>>>>>>>>>> can triage >>>>>>>>>>>> the list, making sure none might be a symptom of something that >>>>>>>>>>>> should >>>>>>>>>>>> block the release. One modification to your proposal is that after >>>>>>>>>>>> manual >>>>>>>>>>>> verification that it is safe to release I would move Fix Version >>>>>>>>>>>> to the >>>>>>>>>>>> next release instead of closing, unless the issue really is fixed >>>>>>>>>>>> or >>>>>>>>>>>> otherwise not reproducible. >>>>>>>>>>>> >>>>>>>>>>>> For automation, I wonder if there's something automatic already >>>>>>>>>>>> available somewhere that would: >>>>>>>>>>>> >>>>>>>>>>>> - mark the Jenkins build to "Keep This Build Forever" >>>>>>>>>>>> - be *very* careful to try to find an existing bug, else it >>>>>>>>>>>> will be spam >>>>>>>>>>>> - file bugs to "test-failures" component >>>>>>>>>>>> - set Fix Version to the "next" - right now we have 2.7.1 >>>>>>>>>>>> (LTS), 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) >>>>>>>>>>>> so need >>>>>>>>>>>> the smarts to choose 2.11.0 >>>>>>>>>>>> >>>>>>>>>>>> If not, I think doing this stuff manually is not that bad, >>>>>>>>>>>> assuming we can stay fairly green. >>>>>>>>>>>> >>>>>>>>>>>> Kenn >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi All, >>>>>>>>>>>>> >>>>>>>>>>>>> There are a number of tests in our system that are either >>>>>>>>>>>>> flaky or permanently red. I am suggesting to add, if not all, >>>>>>>>>>>>> then most of >>>>>>>>>>>>> the tests (style, unit, integration, etc) to the release >>>>>>>>>>>>> validation step. >>>>>>>>>>>>> In this way, we will add a regular cadence to ensuring greenness >>>>>>>>>>>>> and no >>>>>>>>>>>>> flaky tests in Beam. >>>>>>>>>>>>> >>>>>>>>>>>>> There are a number of ways of implementing this, but what I >>>>>>>>>>>>> think might work the best is to set up a process that either >>>>>>>>>>>>> manually or >>>>>>>>>>>>> automatically creates a JIRA for the failing test and assigns it >>>>>>>>>>>>> to a >>>>>>>>>>>>> component tagged with the release number. The release can then >>>>>>>>>>>>> continue >>>>>>>>>>>>> when all JIRAs are closed by either fixing the failure or >>>>>>>>>>>>> manually testing >>>>>>>>>>>>> to ensure no adverse side effects (this is in case there are >>>>>>>>>>>>> environmental >>>>>>>>>>>>> issues in the testing infrastructure or otherwise). >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for reading, what do you think? >>>>>>>>>>>>> - Is there another, easier way to ensure that no test failures >>>>>>>>>>>>> go unfixed? >>>>>>>>>>>>> - Can the process be automated? >>>>>>>>>>>>> - What am I missing? >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Sam >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Got feedback? tinyurl.com/swegner-feedback >>>>>>>>>> >>>>>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Got feedback? tinyurl.com/swegner-feedback >>>>>> >>>>> > > -- > > > > > Got feedback? tinyurl.com/swegner-feedback >
