On Mon, Nov 7, 2016 at 3:20 PM, Ewen Cheslack-Postava <e...@confluent.io> wrote:
> On Mon, Nov 7, 2016 at 10:30 AM Raghav Kumar Gautam <rag...@apache.org> > wrote: > > > Hi Ewen, > > > > Thanks for the feedback. Answers are inlined. > > > > On Sun, Nov 6, 2016 at 8:46 PM, Ewen Cheslack-Postava <e...@confluent.io > > > > wrote: > > > > > Yeah, I'm all for getting these to run more frequently and on lighter > > > weight infrastructure. (By the way, I also saw the use of docker; I'd > > > really like to get a "native" docker cluster type into ducktape at some > > > point so all you have to do is bake the image and then spawn containers > > on > > > demand.) > > > > > I completely agree, supporting docker integration in ducktape would be > the > > ideal solution of the problem. > > > > > > > > > > A few things. First, it'd be nice to know if we can chain these with > > normal > > > PR builds or something like that. Even starting the system tests when > we > > > don't know the unit tests will pass seems like it'd be wasteful. > > > > > If we do chaining one problem that it will bring is that the turn around > > time will suffer. It would take 1.5 hrs to run unit tests then another > 1.5 > > hrs to run decktape tests. Also, don't dev run relevant unit tests before > > they submit a patch ? > > > > Yeah, I get that. Turnaround time will obviously suffer from serializing > anything. Here the biggest problem today is that Jenkins builds are not as > highly parallelized as most users run the tests locally, and the large > number of integration tests that are baked into the unit tests mean they > take quite a long time. While running the tests locally has been creeping > up quite a bit recently, it's still at least < 15min on a relatively recent > MBP. Ideally we could just get the Jenkins builds to finish faster... > I investigated a little bit and it seems that unit tests are not entirely stable. So, it does not make sense to run them serially as of now. For e.g.: https://github.com/apache/kafka/pull/2107 (The unit tests were passing after 1st commit and failing after second commit. The second commit only had comment changes.) https://github.com/apache/kafka/pull/2108 https://github.com/apache/kafka/pull/2099 https://github.com/apache/kafka/pull/2093 > > > > > > Second, agreed on getting things stable before turning this on across > the > > > board. > > > > I have done some work for stabilizing the tests. But I need help from > kafka > > community to take this further. It will be great if someone can guide me > on > > how to do this ? Should we start with a subset of tests that are stable > and > > enable others as we make progress ? Who are the people that can I work > with > > on this problem ? > > > > It'll probably be a variety of people because it depends on the components > that are unstable. For example, just among committers, different folks know > different areas of the code (and especially system tests) to different > degrees. I can probably help across the board in terms of ducktape/system > test stuff, but for any individual test you'll probably just want to git > blame to figure out who might be best to ask for help. > > I can take a pass at this patch and see how much makes sense to commit > immediately. If we don't immediately start getting feedback on failing > tests and can instead make progress by requesting them manually on only > some PRs or something like that, then that seems like it could be > reasonable. > > My biggest concern, just taking a quick pass at the changes, is that we're > doing a lot of renaming of tests just to split them up rather than by > logical grouping. If we need to do this, it seems much better to add a > small amount of tooling to ducktape to execute subsets of tests (e.g. split > across N subsets of the tests). It requires more coordination between > ducktape and getting this landed, but feels like a much cleaner solution, > and one that could eventually take advantage of additional information > (e.g. if it knows avg runtime from previous runs, then it can divide them > based on that instead of only considering the # of tests). I agree that the ideal solution would be to add support for this in ducktape. But since this is going to be a big change, can we do this in a separate jira ? > > > Confluent runs these tests nightly on full VMs in AWS and > > > historically, besides buggy logic in tests, underprovisioned resources > > tend > > > to be the biggest source of flakiness in tests. > > > > > Good to know that I am not the only one worrying about this problem :-) > > > > Finally, should we be checking w/ infra and/or Travis folks before > enabling > > > something this expensive? Are the Storm integration tests of comparable > > > cost? There are some in-flight patches for parallelizing test runs of > > > ducktape tests (which also results in better utilization). But even > with > > > those changes, the full test run is still quite a few VM-hours per PR > and > > > we only expect it to increase. > > > > > We can ask infra people about this. But I think this will not be a > problem. > > For e.g. Flink <https://travis-ci.org/apache/flink/builds/173852382> is > > using 11 hrs of computation time for each run. For kafka we are going to > > start with 6hrs. Also, with the docker setup we can bring up the whole 12 > > node cluster on the laptop and run ducktape tests against it. So, test > > development cycles will become faster. > > > > Sure, it's just that over time this tends to lead to the current state of > the Jenkins where it can take many hours before you get any feedback > because things are so backed up. > > -Ewen > > > > > > With Regards, > > Raghav. > > > > > > > > > > > > -Ewen > > > > > > On Thu, Nov 3, 2016 at 11:26 AM, Becket Qin <becket....@gmail.com> > > wrote: > > > > > > > Thanks for the explanation, Raghav. > > > > > > > > If the workload is not a concern then it is probably fine to run > tests > > > for > > > > each PR update, although it may not be necessary :) > > > > > > > > On Thu, Nov 3, 2016 at 10:40 AM, Raghav Kumar Gautam < > > rag...@apache.org> > > > > wrote: > > > > > > > > > Hi Becket, > > > > > > > > > > The tests would be run each time a PR is created/updated this will > > look > > > > > similar to https://github.com/apache/storm/pulls. Ducktape tests > > take > > > > > about > > > > > 7-8 hours to run on my laptop. For travis-ci we can split them in > > > groups > > > > > and run them in parallel. This was done in the POC run which took > 1.5 > > > > hrs. > > > > > It had 10 splits with 5 jobs running in parallel. > > > > > https://travis-ci.org/raghavgautam/kafka/builds/171502069 > > > > > For apache projects the limit is 30 jobs in parallel and across all > > > > > projects, so I expect it to take less time but it also depends on > the > > > > > workload at the time. > > > > > https://blogs.apache.org/infra/entry/apache_gains_ > > additional_travis_ci > > > > > > > > > > Thanks, > > > > > Raghav. > > > > > > > > > > On Thu, Nov 3, 2016 at 9:41 AM, Becket Qin <becket....@gmail.com> > > > wrote: > > > > > > > > > > > Thanks Raghav, > > > > > > > > > > > > +1 for the idea in general. > > > > > > > > > > > > One thing I am wondering is when the tests would be run? Would it > > be > > > > run > > > > > > when we merge a PR or it would be run every time a PR is > > > > created/updated? > > > > > > I am not sure how long do the tests in other projects take. For > > Kafka > > > > it > > > > > > may take a few hours to run all the ducktape tests, will that be > an > > > > issue > > > > > > if we run the tests for each updates of the PR? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > > > > > On Thu, Nov 3, 2016 at 8:16 AM, Harsha Chintalapani < > > ka...@harsha.io > > > > > > > > > > wrote: > > > > > > > > > > > > > Thanks, Raghav . I am +1 for having this in Kafka. It will help > > > > > identify > > > > > > > any potential issues, especially with big patches. Given that > > we've > > > > > some > > > > > > > tests failing due to timing issues > > > > > > > can we disable the failing tests for now so that we don't get > any > > > > false > > > > > > > negatives? > > > > > > > > > > > > > > Thanks, > > > > > > > Harsha > > > > > > > > > > > > > > On Tue, Nov 1, 2016 at 11:47 AM Raghav Kumar Gautam < > > > > rag...@apache.org > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I want to start a discussion about running ducktape tests for > > > each > > > > > pull > > > > > > > > request. I have been working on KAFKA-4345 > > > > > > > > <https://issues.apache.org/jira/browse/KAFKA-4345> to enable > > > this > > > > > > using > > > > > > > > docker on travis-ci. > > > > > > > > Pull request: https://github.com/apache/kafka/pull/2064 > > > > > > > > Working POC: https://travis-ci.org/ > raghavgautam/kafka/builds/ > > > > > 171502069 > > > > > > > > > > > > > > > > In the POC I am able to run 124/149 tests out of which 88 > pass. > > > The > > > > > > > failure > > > > > > > > are mostly timing issues. We can run the same scripts on the > > > laptop > > > > > > with > > > > > > > > which I am able to run 138/149 tests successfully. > > > > > > > > > > > > > > > > For this to work we need to enable travis-ci for Kafka. I can > > > open > > > > a > > > > > > > infra > > > > > > > > bug to request travis-ci for this. Travis-ci is already > running > > > > tests > > > > > > for > > > > > > > > many apache projects like Storm, Hive, Flume, Thrift etc. > see: > > > > > > > > https://travis-ci.org/apache/. > > > > > > > > > > > > > > > > Does this sound interesting ? Please comment. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Raghav. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Thanks, > > > Ewen > > > > > >