Good points. I wasn't tuned in to those nuances of how the jobs are run. I
think we *could* cause a postcommit job to run against exactly that commit
hash instead of origin/master, but I won't advocate for that. My suggestion
of the "find a green commit" approach is a holdover from continuously
shipping services. It isn't terrifically important when you have a release
branch process.

The bit I feel strongly about is that we should not wait on commits to
master except in catastrophic circumstances. I'm happy with
cut-then-verify/triage.

Kenn

On Wed, Jan 16, 2019 at 9:11 AM Scott Wegner <[email protected]> wrote:

> I like the idea of using test greenness to choose a release commit.
> There's a couple challenges with our current setup:
>
> 1) Post-commits don't run at every commit. The Jenkins jobs are configured
> to run on pushes to master, but (at least some Jobs) are serialized to run
> a single Jenkins job instance at a time, and the next run will be at the
> current HEAD, skipping pushes it didn't get to. So it may be
> hard/impossible to find a commit which had all Post-Commit jobs run against
> it.
>
> 2) I don't see an easy way in Jenkins or GitHub to find overall test
> status for a commit across all test jobs which ran. The GitHub history [1]
> seems to only show badges from PR test runs. Perhaps we're missing
> something in our Jenkins job config to publish the status back to GitHub.
> Or, we could import the necessary data into our Community Metrics DB [2]
> and build our own dashboard [3].
>
> So assuming I'm not missing something, +1 to Sam's proposal to
> cut-then-validate since that seems much easier to get started with today.
>
> [1] https://github.com/apache/beam/commits/master
> [2]
> https://github.com/apache/beam/blob/6c2fe17cfdea1be1fdcfb02267894f0d37a671b3/.test-infra/metrics/sync/jenkins/syncjenkins.py#L38
> [3] https://s.apache.org/beam-community-metrics
>
> On Tue, Jan 15, 2019 at 2:47 PM Kenneth Knowles <[email protected]> wrote:
>
>> Since you brought up the entirety of the process, I would suggest to move
>> the release branch cut up like so:
>>
>>  - Decide to release
>>  - Create a new version in JIRA
>>  - Find a recent green commit (according to postcommit)
>>  - Create a release branch from that commit
>>  - Bump the version on master (green PR w/ parent at the green commit)
>>  - Triage release-blocking JIRAs
>>  - ...
>>
>> Notes:
>>
>>  - Choosing postcommit signal to cut means we already have the signal and
>> we aren't tempted to wait on master
>>  - Cutting before triage starts stabilization process ASAP and gives
>> clear signal on the burndown
>>
>> Kenn
>>
>>
>> On Tue, Jan 15, 2019 at 1:25 PM Sam Rohde <[email protected]> wrote:
>>
>>> +Boyuan Zhang <[email protected]> who is modifying the rc validation
>>> script
>>>
>>> I'm thinking of a small change to the proposed process brought to my
>>> attention from Boyuan.
>>>
>>> Instead of running the additional validation tests during the rc
>>> validation, run the tests and the proposed process after the release branch
>>> has been cut. A couple of reasons why:
>>>
>>>    - The additional validation tests (PostCommit and PreCommit) don't
>>>    run against the RC and are instead run against the branch. This is
>>>    confusing considering the other tests in the RC validation step are per 
>>> RC.
>>>    - The additional validation tests are expensive.
>>>
>>> The final release process would look like:
>>>
>>>    - Decide to release
>>>    - Create a new version in JIRA
>>>    - Triage release-blocking issue in JIRAs
>>>    - Review release notes in JIRA
>>>    - Create a release branch
>>>    - Verify that a release builds
>>>    - >>> Verify that a release passes its tests <<< (this is where the
>>>    new process would be added)
>>>    - Build/test/fix RCs
>>>    - >>> Fix any issues <<< (all JIRAs created during the new process
>>>    will have to be closed by here)
>>>    - Finalize the release
>>>    - Promote the release
>>>
>>>
>>>
>>>
>>> On Thu, Jan 10, 2019 at 4:32 PM Kenneth Knowles <[email protected]> wrote:
>>>
>>>> What do you think about crowd-sourcing?
>>>>
>>>> 1. Fix Version = 2.10.0
>>>> 2. If assigned, ping ticket and maybe assignee, unassign if unresponsive
>>>> 3. If unassigned, assign it to yourself while thinking about it
>>>> 4. If you can route it a bit closer to someone who might know, great
>>>> 5. If it doesn't look like a blocker (after routing best you can), Fix
>>>> Version = 2.11.0
>>>>
>>>> I think this has enough mutexes that there should be no duplicated work
>>>> if it is followed. And every step is a standard use of Fix Version and
>>>> Assignee so there's not really special policy needed.
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Jan 10, 2019 at 4:25 PM Mikhail Gryzykhin <[email protected]>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Although we should be cautious when enabling this policy. We have
>>>>> decent backlog of bugs that we need to plumb through.
>>>>>
>>>>> --Mikhail
>>>>>
>>>>> Have feedback <http://go/migryz-feedback>?
>>>>>
>>>>>
>>>>> On Thu, Jan 10, 2019 at 11:44 AM Scott Wegner <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +1, this sounds good to me.
>>>>>>
>>>>>> I believe the next step would be to open a PR to add this to the
>>>>>> release guide:
>>>>>> https://github.com/apache/beam/blob/master/website/src/contribute/release-guide.md
>>>>>>
>>>>>> On Wed, Jan 9, 2019 at 12:04 PM Sam Rohde <[email protected]> wrote:
>>>>>>
>>>>>>> Cool, thanks for all of the replies. Does this summary sound
>>>>>>> reasonable?
>>>>>>>
>>>>>>> *Problem:* there are a number of failing tests (including flaky)
>>>>>>> that don't get looked at, and aren't necessarily green upon cutting a 
>>>>>>> new
>>>>>>> Beam release.
>>>>>>>
>>>>>>> *Proposed Solution:*
>>>>>>>
>>>>>>>    - Add all tests to the release validation
>>>>>>>    - For all failing tests (including flaky) create a JIRA attached
>>>>>>>    to the Beam release and add to the "test-failures" component*
>>>>>>>    - If a test is continuously failing
>>>>>>>          - fix it
>>>>>>>          - add fix to release
>>>>>>>          - close out JIRA
>>>>>>>       - If a test is flaky
>>>>>>>          - try and fix it
>>>>>>>          - If fixed
>>>>>>>             - add fix to release
>>>>>>>             - close out JIRA
>>>>>>>          - else
>>>>>>>             - manually test it
>>>>>>>             - modify "Fix Version" to next release
>>>>>>>          - The release validation can continue when all JIRAs are
>>>>>>>    closed out.
>>>>>>>
>>>>>>> *Why this is an improvement:*
>>>>>>>
>>>>>>>    - Ensures that every test is a valid signal (as opposed to
>>>>>>>    disabling failing tests)
>>>>>>>    - Creates an incentive to automate tests (no longer on the hook
>>>>>>>    to manually test)
>>>>>>>    - Creates a forcing-function to fix flaky tests (once fixed, no
>>>>>>>    longer needs to be manually tested)
>>>>>>>    - Ensures that every failing test gets looked at
>>>>>>>
>>>>>>> *Why this may not be an improvement:*
>>>>>>>
>>>>>>>    - More effort for release validation
>>>>>>>    - May slow down release velocity
>>>>>>>
>>>>>>> * for brevity, this might be better to create a JIRA per component
>>>>>>> containing a summary of failing tests
>>>>>>>
>>>>>>>
>>>>>>> -Sam
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> For reference, there are currently 34 unresolved JIRA issues
>>>>>>>>>> under the test-failures component [1].
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> And there are 19 labeled with flake or sickbay:
>>>>>>>>> https://issues.apache.org/jira/issues/?filter=12343195
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> This is a a good idea. Some suggestions:
>>>>>>>>>>> - It would be nicer if we can figure out process to act on flaky
>>>>>>>>>>> test more frequently than releases.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Any ideas? We could just have some cadence and try to establish
>>>>>>>>> the practice of having a deflake thread every couple of weeks? How 
>>>>>>>>> about we
>>>>>>>>> add it to release verification as a first step and then continue to 
>>>>>>>>> discuss?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Sounds great. I do not know enough JIRA, but I am hoping that a
>>>>>>>> solution can come in the form of tooling. If we could configure JIRA 
>>>>>>>> with
>>>>>>>> SLOs per issue type, we could have customized reports on which issues 
>>>>>>>> are
>>>>>>>> not getting enough attention and then do a load balance among us.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Another improvement in the process would be having actual owners
>>>>>>>>>>> of issues rather than auto assigned component owners. A few folks 
>>>>>>>>>>> have 100+
>>>>>>>>>>> assigned issues. Unassigning those issues, and finding owners who 
>>>>>>>>>>> would
>>>>>>>>>>> have time to work on identified flaky tests would be helpful.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Yikes. Two issues here:
>>>>>>>>>
>>>>>>>>>  - sounds like Jira component owners aren't really working for us
>>>>>>>>> as a first point of contact for triage
>>>>>>>>>  - a person shouldn't really have more than 5 Jira assigned, or if
>>>>>>>>> you get really loose maybe 20 (I am guilty of having 30 at this 
>>>>>>>>> moment...)
>>>>>>>>>
>>>>>>>>> Maybe this is one or two separate threads?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I can fork this to another thread. I think both issues are related
>>>>>>>> because components owners are more likely to be in this situaion. I 
>>>>>>>> agree
>>>>>>>> with assessment of two issues.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I love this idea. It can easily feel like bugs filed for
>>>>>>>>>>>> Jenkins flakes/failures just get lost if there is no process for 
>>>>>>>>>>>> looking
>>>>>>>>>>>> them over regularly.
>>>>>>>>>>>>
>>>>>>>>>>>> I would suggest that test failures / flakes all get filed with
>>>>>>>>>>>> Fix Version = whatever release is next. Then at release time we 
>>>>>>>>>>>> can triage
>>>>>>>>>>>> the list, making sure none might be a symptom of something that 
>>>>>>>>>>>> should
>>>>>>>>>>>> block the release. One modification to your proposal is that after 
>>>>>>>>>>>> manual
>>>>>>>>>>>> verification that it is safe to release I would move Fix Version 
>>>>>>>>>>>> to the
>>>>>>>>>>>> next release instead of closing, unless the issue really is fixed 
>>>>>>>>>>>> or
>>>>>>>>>>>> otherwise not reproducible.
>>>>>>>>>>>>
>>>>>>>>>>>> For automation, I wonder if there's something automatic already
>>>>>>>>>>>> available somewhere that would:
>>>>>>>>>>>>
>>>>>>>>>>>>  - mark the Jenkins build to "Keep This Build Forever"
>>>>>>>>>>>>  - be *very* careful to try to find an existing bug, else it
>>>>>>>>>>>> will be spam
>>>>>>>>>>>>  - file bugs to "test-failures" component
>>>>>>>>>>>>  - set Fix Version to the "next" - right now we have 2.7.1
>>>>>>>>>>>> (LTS), 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) 
>>>>>>>>>>>> so need
>>>>>>>>>>>> the smarts to choose 2.11.0
>>>>>>>>>>>>
>>>>>>>>>>>> If not, I think doing this stuff manually is not that bad,
>>>>>>>>>>>> assuming we can stay fairly green.
>>>>>>>>>>>>
>>>>>>>>>>>> Kenn
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are a number of tests in our system that are either
>>>>>>>>>>>>> flaky or permanently red. I am suggesting to add, if not all, 
>>>>>>>>>>>>> then most of
>>>>>>>>>>>>> the tests (style, unit, integration, etc) to the release 
>>>>>>>>>>>>> validation step.
>>>>>>>>>>>>> In this way, we will add a regular cadence to ensuring greenness 
>>>>>>>>>>>>> and no
>>>>>>>>>>>>> flaky tests in Beam.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are a number of ways of implementing this, but what I
>>>>>>>>>>>>> think might work the best is to set up a process that either 
>>>>>>>>>>>>> manually or
>>>>>>>>>>>>> automatically creates a JIRA for the failing test and assigns it 
>>>>>>>>>>>>> to a
>>>>>>>>>>>>> component tagged with the release number. The release can then 
>>>>>>>>>>>>> continue
>>>>>>>>>>>>> when all JIRAs are closed by either fixing the failure or 
>>>>>>>>>>>>> manually testing
>>>>>>>>>>>>> to ensure no adverse side effects (this is in case there are 
>>>>>>>>>>>>> environmental
>>>>>>>>>>>>> issues in the testing infrastructure or otherwise).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for reading, what do you think?
>>>>>>>>>>>>> - Is there another, easier way to ensure that no test failures
>>>>>>>>>>>>> go unfixed?
>>>>>>>>>>>>> - Can the process be automated?
>>>>>>>>>>>>> - What am I missing?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Sam
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>>>
>>>>>
>
> --
>
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>

Reply via email to