Cool, thanks for all of the replies. Does this summary sound reasonable?

*Problem:* there are a number of failing tests (including flaky) that don't
get looked at, and aren't necessarily green upon cutting a new Beam release.

*Proposed Solution:*

   - Add all tests to the release validation
   - For all failing tests (including flaky) create a JIRA attached to the
   Beam release and add to the "test-failures" component*
   - If a test is continuously failing
         - fix it
         - add fix to release
         - close out JIRA
      - If a test is flaky
         - try and fix it
         - If fixed
            - add fix to release
            - close out JIRA
         - else
            - manually test it
            - modify "Fix Version" to next release
         - The release validation can continue when all JIRAs are closed
   out.

*Why this is an improvement:*

   - Ensures that every test is a valid signal (as opposed to disabling
   failing tests)
   - Creates an incentive to automate tests (no longer on the hook to
   manually test)
   - Creates a forcing-function to fix flaky tests (once fixed, no longer
   needs to be manually tested)
   - Ensures that every failing test gets looked at

*Why this may not be an improvement:*

   - More effort for release validation
   - May slow down release velocity

* for brevity, this might be better to create a JIRA per component
containing a summary of failing tests


-Sam




On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <[email protected]> wrote:

>
>
> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <[email protected]> wrote:
>
>>
>>
>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <[email protected]> wrote:
>>
>>> For reference, there are currently 34 unresolved JIRA issues under the
>>> test-failures component [1].
>>>
>>> [1]
>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>
>>
>> And there are 19 labeled with flake or sickbay:
>> https://issues.apache.org/jira/issues/?filter=12343195
>>
>>
>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <[email protected]> wrote:
>>>
>>>> This is a a good idea. Some suggestions:
>>>> - It would be nicer if we can figure out process to act on flaky test
>>>> more frequently than releases.
>>>>
>>>
>> Any ideas? We could just have some cadence and try to establish the
>> practice of having a deflake thread every couple of weeks? How about we add
>> it to release verification as a first step and then continue to discuss?
>>
>
> Sounds great. I do not know enough JIRA, but I am hoping that a solution
> can come in the form of tooling. If we could configure JIRA with SLOs per
> issue type, we could have customized reports on which issues are not
> getting enough attention and then do a load balance among us.
>
>
>>
>> - Another improvement in the process would be having actual owners of
>>>> issues rather than auto assigned component owners. A few folks have 100+
>>>> assigned issues. Unassigning those issues, and finding owners who would
>>>> have time to work on identified flaky tests would be helpful.
>>>>
>>>
>> Yikes. Two issues here:
>>
>>  - sounds like Jira component owners aren't really working for us as a
>> first point of contact for triage
>>  - a person shouldn't really have more than 5 Jira assigned, or if you
>> get really loose maybe 20 (I am guilty of having 30 at this moment...)
>>
>> Maybe this is one or two separate threads?
>>
>
> I can fork this to another thread. I think both issues are related because
> components owners are more likely to be in this situaion. I agree with
> assessment of two issues.
>
>
>>
>> Kenn
>>
>>
>>>
>>>>
>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <[email protected]> wrote:
>>>>
>>>>> I love this idea. It can easily feel like bugs filed for Jenkins
>>>>> flakes/failures just get lost if there is no process for looking them over
>>>>> regularly.
>>>>>
>>>>> I would suggest that test failures / flakes all get filed with Fix
>>>>> Version = whatever release is next. Then at release time we can triage the
>>>>> list, making sure none might be a symptom of something that should block
>>>>> the release. One modification to your proposal is that after manual
>>>>> verification that it is safe to release I would move Fix Version to the
>>>>> next release instead of closing, unless the issue really is fixed or
>>>>> otherwise not reproducible.
>>>>>
>>>>> For automation, I wonder if there's something automatic already
>>>>> available somewhere that would:
>>>>>
>>>>>  - mark the Jenkins build to "Keep This Build Forever"
>>>>>  - be *very* careful to try to find an existing bug, else it will be
>>>>> spam
>>>>>  - file bugs to "test-failures" component
>>>>>  - set Fix Version to the "next" - right now we have 2.7.1 (LTS),
>>>>> 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) so need the
>>>>> smarts to choose 2.11.0
>>>>>
>>>>> If not, I think doing this stuff manually is not that bad, assuming we
>>>>> can stay fairly green.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <[email protected]> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> There are a number of tests in our system that are either flaky or
>>>>>> permanently red. I am suggesting to add, if not all, then most of the 
>>>>>> tests
>>>>>> (style, unit, integration, etc) to the release validation step. In this
>>>>>> way, we will add a regular cadence to ensuring greenness and no flaky 
>>>>>> tests
>>>>>> in Beam.
>>>>>>
>>>>>> There are a number of ways of implementing this, but what I think
>>>>>> might work the best is to set up a process that either manually or
>>>>>> automatically creates a JIRA for the failing test and assigns it to a
>>>>>> component tagged with the release number. The release can then continue
>>>>>> when all JIRAs are closed by either fixing the failure or manually 
>>>>>> testing
>>>>>> to ensure no adverse side effects (this is in case there are 
>>>>>> environmental
>>>>>> issues in the testing infrastructure or otherwise).
>>>>>>
>>>>>> Thanks for reading, what do you think?
>>>>>> - Is there another, easier way to ensure that no test failures go
>>>>>> unfixed?
>>>>>> - Can the process be automated?
>>>>>> - What am I missing?
>>>>>>
>>>>>> Regards,
>>>>>> Sam
>>>>>>
>>>>>>
>>>
>>> --
>>>
>>>
>>>
>>>
>>> Got feedback? tinyurl.com/swegner-feedback
>>>
>>

Reply via email to