Re: [DISCUSS] Unsustainable situation with ptests

Alan Gates Mon, 14 May 2018 14:35:25 -0700

I see there's support for this, but people are still pouring in commits.
I proposed we have a quick vote on this to lock down the commits until we
get to green.  That way everyone knows we have drawn the line at a specific
point.  Any commits after that point would be reverted.  There isn't a
category in the bylaws that fits this kind of vote but I suggest lazy
majority as the most appropriate one (at least 3 votes, more +1s than
-1s).


Alan.

On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <[email protected]>
wrote:

> I worked on a few quick-fix optimizations in Ptest infrastructure over the
> weekend which reduced the execution run from ~90 min to ~70 min per run. I
> had to restart Ptest multiple times. I was resubmitting the patches which
> were in the queue manually, but I may have missed a few. In case you have a
> patch which is pending pre-commit and you don't see it in the queue, please
> submit it manually or let me know if you don't have access to the jenkins
> job. I will continue to work on the sub-tasks in HIVE-19425 and will do
> some maintenance next weekend as well.
>
> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
> [email protected]> wrote:
>
> > Vineet has already been working on disabling those tests that were timing
> > out. I am working on disabling those that are generating different q
> files
> > consistently for last ptests n runs. I am keeping track of all these
> tests
> > in https://issues.apache.org/jira/browse/HIVE-19509.
> >
> > -Jesús
> >
> > On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
> > [email protected]> wrote:
> >
> >     +1 on freezing commits until we get repetitive green tests. We should
> > probably disable (and remember in a jira to reenable then at later point)
> > tests that are flaky to get repetitive green test runs.
> >
> >     Thanks
> >     Prasanth
> >
> >
> >
> >     On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
> [email protected]
> > <mailto:[email protected]>> wrote:
> >
> >
> >     +1 to freezing commits until we stabilize
> >
> >     On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
> >     wrote:
> >
> >     > In order to understand the end-to-end precommit flow I would like
> to
> > get
> >     > access to the PreCommit-HIVE-Build jenkins script. Does anyone one
> > know how
> >     > can I get that?
> >     >
> >     > On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
> >     > [email protected]> wrote:
> >     >
> >     > > Bq. For the short term green runs, I think we should @Ignore the
> > tests
> >     > > which
> >     > > are known to be failing since many runs. They are anyways not
> being
> >     > > addressed as such. If people think they are important to be run
> we
> > should
> >     > > fix them and only then re-enable them.
> >     > >
> >     > > I think that is a good idea, as we would minimize the time that
> we
> > halt
> >     > > development. We can create a JIRA where we list all tests that
> were
> >     > > failing, and we have disabled to get the clean run. From that
> > moment, we
> >     > > will have zero tolerance towards committing with failing tests.
> > And we
> >     > need
> >     > > to pick up those tests that should not be ignored and bring them
> > up again
> >     > > but passing. If there is no disagreement, I can start working on
> > that.
> >     > >
> >     > > Once I am done, I can try to help with infra tickets too.
> >     > >
> >     > > -Jesús
> >     > >
> >     > >
> >     > > On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
> >     > >
> >     > >     +1. I strongly vote for freezing commits and getting our
> > testing
> >     > > coverage in acceptable state.  We have been struggling to
> stabilize
> >     > > branch-3 due to test failures and releasing Hive 3.0 in current
> > state
> >     > would
> >     > > be unacceptable.
> >     > >
> >     > >     Currently there are quite a few test suites which are not
> even
> >     > running
> >     > > and are being timed out. We have been committing patches (to both
> >     > branch-3
> >     > > and master) without test coverage for these tests.
> >     > >     We should immediately figure out what’s going on before we
> > proceed
> >     > > with commits.
> >     > >
> >     > >     For reference following test suites are timing out on
> master: (
> >     > > https://issues.apache.org/jira/browse/HIVE-19506)
> >     > >
> >     > >
> >     > >     TestDbNotificationListener - did not produce a TEST-*.xml
> file
> >     > (likely
> >     > > timed out)
> >     > >
> >     > >     TestHCatHiveCompatibility - did not produce a TEST-*.xml file
> > (likely
> >     > > timed out)
> >     > >
> >     > >     TestNegativeCliDriver - did not produce a TEST-*.xml file
> > (likely
> >     > > timed out)
> >     > >
> >     > >     TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
> file
> >     > (likely
> >     > > timed out)
> >     > >
> >     > >     TestSequenceFileReadWrite - did not produce a TEST-*.xml file
> > (likely
> >     > > timed out)
> >     > >
> >     > >     TestTxnExIm - did not produce a TEST-*.xml file (likely timed
> > out)
> >     > >
> >     > >
> >     > >     Vineet
> >     > >
> >     > >
> >     > >     On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
> >     > [email protected]
> >     > > > wrote:
> >     > >
> >     > >     +1 There are many problems with the test infrastructure and
> in
> > my
> >     > > opinion
> >     > >     it has not become number one bottleneck for the project. I
> was
> >     > looking
> >     > > at
> >     > >     the infrastructure yesterday and I think the current
> > infrastructure
> >     > > (even
> >     > >     its own set of problems) is still under-utilized. I am
> > planning to
> >     > > increase
> >     > >     the number of threads to process the parallel test batches to
> > start
> >     > > with.
> >     > >     It needs a restart on the server side. I can do it now, it
> > folks are
> >     > > okay
> >     > >     with it. Else I can do it over weekend when the queue is
> small.
> >     > >
> >     > >     I listed the improvements which I thought would be useful
> under
> >     > >     https://issues.apache.org/jira/browse/HIVE-19425 but frankly
> >     > speaking
> >     > > I am
> >     > >     not able to devote as much time as I would like to on it. I
> > would
> >     > >     appreciate if folks who have some more time if they can help
> > out.
> >     > >
> >     > >     I think to start with https://issues.apache.org/
> >     > jira/browse/HIVE-19429
> >     > > will
> >     > >     help a lot. We need to pack more test runs in parallel and
> > containers
> >     > >     provide good isolation.
> >     > >
> >     > >     For the short term green runs, I think we should @Ignore the
> > tests
> >     > > which
> >     > >     are known to be failing since many runs. They are anyways not
> > being
> >     > >     addressed as such. If people think they are important to be
> > run we
> >     > > should
> >     > >     fix them and only then re-enable them.
> >     > >
> >     > >     Also, I feel we need light-weight test run which we can run
> > locally
> >     > > before
> >     > >     submitting it for the full-suite. That way minor issues with
> > the
> >     > patch
> >     > > can
> >     > >     be handled locally. May be create a profile which runs a
> > subset of
> >     > >     important tests which are consistent. We can apply some label
> > that
> >     > >     pre-checkin-local tests are runs successful and only then we
> > submit
> >     > > for the
> >     > >     full-suite.
> >     > >
> >     > >     More thoughts are welcome. Thanks for starting this
> > conversation.
> >     > >
> >     > >     On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
> >     > >     [email protected]> wrote:
> >     > >
> >     > >     I believe we have reached a state (maybe we did reach it a
> > while ago)
> >     > > that
> >     > >     is not sustainable anymore, as there are so many tests
> failing
> > /
> >     > > timing out
> >     > >     that it is not possible to verify whether a patch is breaking
> > some
> >     > > critical
> >     > >     parts of the system or not. It also seems to me that due to
> the
> >     > > timeouts
> >     > >     (maybe due to infra, maybe not), ptest runs are taking even
> > longer
> >     > than
> >     > >     usual, which in turn creates even longer queue of patches.
> >     > >
> >     > >     There is an ongoing effort to improve ptests usability (
> >     > >     https://issues.apache.org/jira/browse/HIVE-19425), but apart
> > from
> >     > > that,
> >     > >     we need to make an effort to stabilize existing tests and
> > bring that
> >     > >     failure count to zero.
> >     > >
> >     > >     Hence, I am suggesting *we stop committing any patch before
> we
> > get a
> >     > > green
> >     > >     run*. If someone thinks this proposal is too radical, please
> > come up
> >     > > with
> >     > >     an alternative, because I do not think it is OK to have the
> > ptest
> >     > runs
> >     > > in
> >     > >     their current state. Other projects of certain size (e.g.,
> > Hadoop,
> >     > > Spark)
> >     > >     are always green, we should be able to do the same.
> >     > >
> >     > >     Finally, once we get to zero failures, I suggest we are less
> > tolerant
> >     > > with
> >     > >     committing without getting a clean ptests run. If there is a
> > failure,
> >     > > we
> >     > >     need to fix it or revert the patch that caused it, then we
> > continue
> >     > >     developing.
> >     > >
> >     > >     Please, let’s all work together as a community to fix this
> > issue,
> >     > that
> >     > > is
> >     > >     the only way to get to zero quickly.
> >     > >
> >     > >     Thanks,
> >     > >     Jesús
> >     > >
> >     > >     PS. I assume the flaky tests will come into the discussion.
> > Let´s see
> >     > >     first how many of those we have, then we can work to find a
> > fix.
> >     > >
> >     > >
> >     > >
> >     > >
> >     > >
> >     > >
> >     > >
> >     > >
> >     >
> >
> >
> >
> >     --
> >     Best regards!
> >     Rui Li
> >
> >
> >
> >
> >
>

Re: [DISCUSS] Unsustainable situation with ptests

Reply via email to