Re: [DISCUSS] Unsustainable situation with ptests

Sergey Shelukhin Mon, 14 May 2018 16:11:30 -0700

Can we please make this freeze conditional, i.e. we unfreeze automatically
after ptest is clean (as evidenced by the clean HiveQA run on a given
JIRA).


On 18/5/14, 15:16, "Alan Gates" <alanfga...@gmail.com> wrote:

>We should do it in a separate thread so that people can see it with the
>[VOTE] subject.  Some people use that as a filter in their email to know
>when to pay attention to things.
>
>Alan.
>
>On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
>pjayachand...@hortonworks.com> wrote:
>
>> Will there be a separate voting thread? Or the voting on this thread is
>> sufficient for lock down?
>>
>> Thanks
>> Prasanth
>>
>> > On May 14, 2018, at 2:34 PM, Alan Gates <alanfga...@gmail.com> wrote:
>> >
>> > I see there's support for this, but people are still pouring in
>>commits.
>> > I proposed we have a quick vote on this to lock down the commits
>>until we
>> > get to green.  That way everyone knows we have drawn the line at a
>> specific
>> > point.  Any commits after that point would be reverted.  There isn't a
>> > category in the bylaws that fits this kind of vote but I suggest lazy
>> > majority as the most appropriate one (at least 3 votes, more +1s than
>> > -1s).
>> >
>> > Alan.
>> >
>> > On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
>> vih...@cloudera.com>
>> > wrote:
>> >
>> >> I worked on a few quick-fix optimizations in Ptest infrastructure
>>over
>> the
>> >> weekend which reduced the execution run from ~90 min to ~70 min per
>> run. I
>> >> had to restart Ptest multiple times. I was resubmitting the patches
>> which
>> >> were in the queue manually, but I may have missed a few. In case you
>> have a
>> >> patch which is pending pre-commit and you don't see it in the queue,
>> please
>> >> submit it manually or let me know if you don't have access to the
>> jenkins
>> >> job. I will continue to work on the sub-tasks in HIVE-19425 and will
>>do
>> >> some maintenance next weekend as well.
>> >>
>> >> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
>> >> jcama...@apache.org> wrote:
>> >>
>> >>> Vineet has already been working on disabling those tests that were
>> timing
>> >>> out. I am working on disabling those that are generating different q
>> >> files
>> >>> consistently for last ptests n runs. I am keeping track of all these
>> >> tests
>> >>> in https://issues.apache.org/jira/browse/HIVE-19509.
>> >>>
>> >>> -Jesús
>> >>>
>> >>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
>> >>> pjayachand...@hortonworks.com> wrote:
>> >>>
>> >>>    +1 on freezing commits until we get repetitive green tests. We
>> should
>> >>> probably disable (and remember in a jira to reenable then at later
>> point)
>> >>> tests that are flaky to get repetitive green test runs.
>> >>>
>> >>>    Thanks
>> >>>    Prasanth
>> >>>
>> >>>
>> >>>
>> >>>    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
>> >> lirui.fu...@gmail.com
>> >>> <mailto:lirui.fu...@gmail.com>> wrote:
>> >>>
>> >>>
>> >>>    +1 to freezing commits until we stabilize
>> >>>
>> >>>    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
>> >>>    wrote:
>> >>>
>> >>>> In order to understand the end-to-end precommit flow I would like
>> >> to
>> >>> get
>> >>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
>> >>> know how
>> >>>> can I get that?
>> >>>>
>> >>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
>> >>>> jcama...@apache.org> wrote:
>> >>>>
>> >>>>> Bq. For the short term green runs, I think we should @Ignore the
>> >>> tests
>> >>>>> which
>> >>>>> are known to be failing since many runs. They are anyways not
>> >> being
>> >>>>> addressed as such. If people think they are important to be run
>> >> we
>> >>> should
>> >>>>> fix them and only then re-enable them.
>> >>>>>
>> >>>>> I think that is a good idea, as we would minimize the time that
>> >> we
>> >>> halt
>> >>>>> development. We can create a JIRA where we list all tests that
>> >> were
>> >>>>> failing, and we have disabled to get the clean run. From that
>> >>> moment, we
>> >>>>> will have zero tolerance towards committing with failing tests.
>> >>> And we
>> >>>> need
>> >>>>> to pick up those tests that should not be ignored and bring them
>> >>> up again
>> >>>>> but passing. If there is no disagreement, I can start working on
>> >>> that.
>> >>>>>
>> >>>>> Once I am done, I can try to help with infra tickets too.
>> >>>>>
>> >>>>> -Jesús
>> >>>>>
>> >>>>>
>> >>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
>> >>>>>
>> >>>>>    +1. I strongly vote for freezing commits and getting our
>> >>> testing
>> >>>>> coverage in acceptable state.  We have been struggling to
>> >> stabilize
>> >>>>> branch-3 due to test failures and releasing Hive 3.0 in current
>> >>> state
>> >>>> would
>> >>>>> be unacceptable.
>> >>>>>
>> >>>>>    Currently there are quite a few test suites which are not
>> >> even
>> >>>> running
>> >>>>> and are being timed out. We have been committing patches (to both
>> >>>> branch-3
>> >>>>> and master) without test coverage for these tests.
>> >>>>>    We should immediately figure out what’s going on before we
>> >>> proceed
>> >>>>> with commits.
>> >>>>>
>> >>>>>    For reference following test suites are timing out on
>> >> master: (
>> >>>>> https://issues.apache.org/jira/browse/HIVE-19506)
>> >>>>>
>> >>>>>
>> >>>>>    TestDbNotificationListener - did not produce a TEST-*.xml
>> >> file
>> >>>> (likely
>> >>>>> timed out)
>> >>>>>
>> >>>>>    TestHCatHiveCompatibility - did not produce a TEST-*.xml file
>> >>> (likely
>> >>>>> timed out)
>> >>>>>
>> >>>>>    TestNegativeCliDriver - did not produce a TEST-*.xml file
>> >>> (likely
>> >>>>> timed out)
>> >>>>>
>> >>>>>    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
>> >> file
>> >>>> (likely
>> >>>>> timed out)
>> >>>>>
>> >>>>>    TestSequenceFileReadWrite - did not produce a TEST-*.xml file
>> >>> (likely
>> >>>>> timed out)
>> >>>>>
>> >>>>>    TestTxnExIm - did not produce a TEST-*.xml file (likely timed
>> >>> out)
>> >>>>>
>> >>>>>
>> >>>>>    Vineet
>> >>>>>
>> >>>>>
>> >>>>>    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
>> >>>> vih...@cloudera.com
>> >>>>>> wrote:
>> >>>>>
>> >>>>>    +1 There are many problems with the test infrastructure and
>> >> in
>> >>> my
>> >>>>> opinion
>> >>>>>    it has not become number one bottleneck for the project. I
>> >> was
>> >>>> looking
>> >>>>> at
>> >>>>>    the infrastructure yesterday and I think the current
>> >>> infrastructure
>> >>>>> (even
>> >>>>>    its own set of problems) is still under-utilized. I am
>> >>> planning to
>> >>>>> increase
>> >>>>>    the number of threads to process the parallel test batches to
>> >>> start
>> >>>>> with.
>> >>>>>    It needs a restart on the server side. I can do it now, it
>> >>> folks are
>> >>>>> okay
>> >>>>>    with it. Else I can do it over weekend when the queue is
>> >> small.
>> >>>>>
>> >>>>>    I listed the improvements which I thought would be useful
>> >> under
>> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425 but frankly
>> >>>> speaking
>> >>>>> I am
>> >>>>>    not able to devote as much time as I would like to on it. I
>> >>> would
>> >>>>>    appreciate if folks who have some more time if they can help
>> >>> out.
>> >>>>>
>> >>>>>    I think to start with https://issues.apache.org/
>> >>>> jira/browse/HIVE-19429
>> >>>>> will
>> >>>>>    help a lot. We need to pack more test runs in parallel and
>> >>> containers
>> >>>>>    provide good isolation.
>> >>>>>
>> >>>>>    For the short term green runs, I think we should @Ignore the
>> >>> tests
>> >>>>> which
>> >>>>>    are known to be failing since many runs. They are anyways not
>> >>> being
>> >>>>>    addressed as such. If people think they are important to be
>> >>> run we
>> >>>>> should
>> >>>>>    fix them and only then re-enable them.
>> >>>>>
>> >>>>>    Also, I feel we need light-weight test run which we can run
>> >>> locally
>> >>>>> before
>> >>>>>    submitting it for the full-suite. That way minor issues with
>> >>> the
>> >>>> patch
>> >>>>> can
>> >>>>>    be handled locally. May be create a profile which runs a
>> >>> subset of
>> >>>>>    important tests which are consistent. We can apply some label
>> >>> that
>> >>>>>    pre-checkin-local tests are runs successful and only then we
>> >>> submit
>> >>>>> for the
>> >>>>>    full-suite.
>> >>>>>
>> >>>>>    More thoughts are welcome. Thanks for starting this
>> >>> conversation.
>> >>>>>
>> >>>>>    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
>> >>>>>    jcama...@apache.org> wrote:
>> >>>>>
>> >>>>>    I believe we have reached a state (maybe we did reach it a
>> >>> while ago)
>> >>>>> that
>> >>>>>    is not sustainable anymore, as there are so many tests
>> >> failing
>> >>> /
>> >>>>> timing out
>> >>>>>    that it is not possible to verify whether a patch is breaking
>> >>> some
>> >>>>> critical
>> >>>>>    parts of the system or not. It also seems to me that due to
>> >> the
>> >>>>> timeouts
>> >>>>>    (maybe due to infra, maybe not), ptest runs are taking even
>> >>> longer
>> >>>> than
>> >>>>>    usual, which in turn creates even longer queue of patches.
>> >>>>>
>> >>>>>    There is an ongoing effort to improve ptests usability (
>> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425), but apart
>> >>> from
>> >>>>> that,
>> >>>>>    we need to make an effort to stabilize existing tests and
>> >>> bring that
>> >>>>>    failure count to zero.
>> >>>>>
>> >>>>>    Hence, I am suggesting *we stop committing any patch before
>> >> we
>> >>> get a
>> >>>>> green
>> >>>>>    run*. If someone thinks this proposal is too radical, please
>> >>> come up
>> >>>>> with
>> >>>>>    an alternative, because I do not think it is OK to have the
>> >>> ptest
>> >>>> runs
>> >>>>> in
>> >>>>>    their current state. Other projects of certain size (e.g.,
>> >>> Hadoop,
>> >>>>> Spark)
>> >>>>>    are always green, we should be able to do the same.
>> >>>>>
>> >>>>>    Finally, once we get to zero failures, I suggest we are less
>> >>> tolerant
>> >>>>> with
>> >>>>>    committing without getting a clean ptests run. If there is a
>> >>> failure,
>> >>>>> we
>> >>>>>    need to fix it or revert the patch that caused it, then we
>> >>> continue
>> >>>>>    developing.
>> >>>>>
>> >>>>>    Please, let’s all work together as a community to fix this
>> >>> issue,
>> >>>> that
>> >>>>> is
>> >>>>>    the only way to get to zero quickly.
>> >>>>>
>> >>>>>    Thanks,
>> >>>>>    Jesús
>> >>>>>
>> >>>>>    PS. I assume the flaky tests will come into the discussion.
>> >>> Let´s see
>> >>>>>    first how many of those we have, then we can work to find a
>> >>> fix.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>>    --
>> >>>    Best regards!
>> >>>    Rui Li
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>>
>>

Re: [DISCUSS] Unsustainable situation with ptests

Reply via email to