Hi Remi, i do not agree with “There is no bigger problem” part of your reply. so I had to repeat myself to make it more clear, Not because i am not aware of what this thread is supposed to do. Regards, Bharat.
On 28-Sep-2015, at 2:51 pm, Remi Bergsma <rberg...@schubergphilis.com> wrote: > Hi Bharat, > > I understand your frustrations but we already agreed on this so no need to > repeat. This thread is supposed to list some improvements and learn from it. > Your point has been taken so let’s move on. > > We need documentation first, then do a change after which all tests should > pass. Even better is we write (missing) tests before changing stuff so you > know they pass before and after the fact. > > When doing reviews, feel free to ask for design documentation if you feel it > is needed. > > Regards, Remi > > > > On 28/09/15 11:02, "Bharat Kumar" <bharat.ku...@citrix.com> wrote: > >> Hi Remi, >> >> I never intended to say that we should not run tests, but even before tests >> we should have proper documentation. My concern was if a major change is >> being introduced it should be properly documented. All the issues which we >> are trying to fix are majorly due to VR refactor. If there was a proper >> documentation for this we could >> have fixed this in a better way. Even to add tests we need to understand >> how a particular thing works and what data dose it expect. I think while >> fixing the python based code changes this is where most of the people are >> facing issues. A proper documentation will help in understanding these in a >> better way. >> >> Thanks, >> Bharat. >> >> On 28-Sep-2015, at 1:57 pm, Remi Bergsma <rberg...@schubergphilis.com> wrote: >> >>> Hi Bharat, >>> >>> There is no bigger problem. We should always run the tests and if we find a >>> case that isn’t currently covered by the tests we should simply add tests >>> for it. There’s no way we’ll get a stable master without them. The fact >>> that they may not cover everything, is no reason to not rely on them. If a >>> feature is not important enough to write a test for, then the feature is >>> probably not important anyway. And if it is, then add a test :-) >>> >>> I do agree on the design documentation requirement for any (major?) change. >>> I found some design documentations on the subject you mention, but it >>> should have been more detailed. >>> >>> Regards, >>> Remi >>> >>> >>> >>> >>> >>> >>> On 28/09/15 09:58, "Bharat Kumar" <bharat.ku...@citrix.com> wrote: >>> >>>> Hi Remi, >>>> >>>> Thank you for the Blame less postmortem. >>>> >>>> I think there is a bigger problem here than just the review process and >>>> running tests. Even if we run the tests we cannot be sure that every thing >>>> will work as intended. The tests will only give some level of confidence. >>>> The tests may not cover all the use cases. >>>> >>>> I think the problem here is that the way major changes to the code base >>>> are dealt with. For example, VR refactoring was done without discussing >>>> the design implications and the amount of changes it would bring in. I >>>> could not find any design document. The vr refactor changed a lot of code >>>> and the way VR used to work and in my opinion it was incomplete-vpn, >>>> isolated networks, basic networks, iptable rules and rvr in isolated case >>>> etc were not implemented. Most of us are still in the process of >>>> understanding this. Even before reaching this state we had to spend a lot >>>> of time fixing issues mentioned in the thread [Blocker/Critical] VR >>>> related Issues. >>>> >>>> When a change of this magnitude is being made, we should call out all the >>>> changes and document them properly. This will help people to create better >>>> fixes. Currently when we attempt to fix the isolated vr case it is >>>> effecting the vpc and vice versa. for example pr 738 fixed it for vpc >>>> networks but broke it for isolated case. I believe it is not too late to >>>> at least start documenting the changes now. >>>> >>>> Thanks, >>>> Bharat. >>>> >>>> On 28-Sep-2015, at 10:52 am, Sanjeev N <sanj...@apache.org> wrote: >>>> >>>>> I have a concern here. Some of us are actively involved in reviewing the >>>>> PRs related to marvin tests(Enhancing existing tests/Adding new tests). If >>>>> we have to test a PR it requires an environment to be created with actual >>>>> resources and this is going to take lot of time. Some of the tests can run >>>>> on simulator but most of the tests require real hardware to test. PR >>>>> submitter is already testing and submitting the test results along with >>>>> the >>>>> PR. So is it require to test these PRs by reviewers? >>>>> >>>>> On Sat, Sep 26, 2015 at 1:49 PM, sebgoa <run...@gmail.com> wrote: >>>>> >>>>>> Remi, thanks for the detailed post-mortem, it's a good read and great >>>>>> learning. >>>>>> I hope everyone reads it. >>>>>> >>>>>> The one thing to emphasize is that we now have a very visible way to get >>>>>> code into master, we have folks investing time to provide review (great), >>>>>> we need the submitters to make due diligence and answer all comments in >>>>>> the >>>>>> reviews. >>>>>> >>>>>> In another project i work on, nothing can be added to the code without >>>>>> unit tests. I think we could go down the route of asking for new >>>>>> integration tests and unit tests for anything. If not, the PR does not >>>>>> get >>>>>> merged. But let's digest your post-mortem and we can discuss after 4.6.0. >>>>>> >>>>>> I see that you reverted one commit that was not made by you, that's >>>>>> great. >>>>>> >>>>>> Let's focus on the blockers now, everything else can wait. >>>>>> >>>>>> The big bonus of doing what we are doing is that once 4.6.0 is out, we >>>>>> can >>>>>> merge PRs again (assuming they are properly rebased and tested) and we >>>>>> can >>>>>> release 4.6.1 really quickly after. >>>>>> >>>>>> -sebastien >>>>>> >>>>>> On Sep 25, 2015, at 9:51 PM, Remi Bergsma <rberg...@schubergphilis.com> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> This mail is intended to be blameless. We need to learn something from >>>>>> it. That's why I left out who exactly did what because it’s not relevant. >>>>>> There are multiple examples but it's about the why. Let's learn from this >>>>>> without blaming anyone. >>>>>>> >>>>>>> We know we need automated testing. We have integration tests, but we are >>>>>> unable to run all of them on any Pull Request we receive. If we would >>>>>> have >>>>>> that in place, it'd be much easier to spot errors, regression and so on. >>>>>> It'd also be more rewarding to write more tests. >>>>>>> >>>>>>> Unfortunately we're not there yet. So, we need to do something else >>>>>> instead until we get there. If we do nothing, we know we have many issues >>>>>> because a master that breaks on a regular basis is the most frustrating >>>>>> things. We said we'd use Pull Requests with at least two humans to review >>>>>> and give their OK for a Pull Request. In the form of LGTM: Looks Good To >>>>>> Me. Ok, so the LGTMs are there because we have no automated testing. Keep >>>>>> that in mind. You are supposed to replace automated testing until it's >>>>>> there. >>>>>>> >>>>>>> Since we do this, master got a lot more stable. But every now and then >>>>>> we still have issues. Let's look at how we do manual reviews. Again, this >>>>>> is not to blame anyone. It's to open our eyes and make us realise what >>>>>> we're doing and what results we get out of that. >>>>>>> >>>>>>> >>>>>>> Example Pull Request #784: >>>>>>> Title: CLOUDSTACK-8799 fixed the default routes >>>>>>> >>>>>>> That's nice, it has a Jira id and a short description (as it should be). >>>>>>> >>>>>>> The first person comes along and makes a comment: >>>>>>> "There was also an issue with VPC VRs" ... "Have you seen this issue? >>>>>> Does your change affects the VPC VR (single/redundant)?" >>>>>>> >>>>>>> Actually a good question. Unfortunaly there comes no answer. After a >>>>>> reminder, it was promised to do tests against VPC networks. Great! >>>>>>> >>>>>>> The Jenkins builds both succeed and also Travis is green. But how much >>>>>> value does this have? They have the impression to do automated testing, >>>>>> and >>>>>> although you could argue they do, it's far from complete. If it breaks, >>>>>> you >>>>>> know you have an issue. But it doesn’t work the other way around. >>>>>>> >>>>>>> Back to our example PR. In the mean time, another commit gets pushed to >>>>>> it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at the Jira >>>>>> issue, you see it is about redundant virtual routers. The non-VPC ones. >>>>>> So >>>>>> this is vague at best. But a reviewer gives a LGTM because the person >>>>>> could >>>>>> create a VPC. That doesn't have anything to do with the problem being >>>>>> fixed >>>>>> in this PR nor with the comments made earlier. But, at least the person >>>>>> said what he did and we should all do that. What nobody knew back then, >>>>>> was >>>>>> that this broke the default route on VPCs. >>>>>>> >>>>>>> Then something strange happens: the two commits from the PR end up on >>>>>> master as direct commits. With just one LGTM and no verification from the >>>>>> person commenting about the linked issue. This happened on Friday >>>>>> September >>>>>> 11th. >>>>>>> >>>>>>> That day 21 commits came in, from 7 Pull Request and unfortunately also >>>>>> from some direct commits. We noticed the direct commits and notified the >>>>>> list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As a lot >>>>>> came in at the same time, it was decided not to revert them. Looking >>>>>> back, >>>>>> we should have done it. >>>>>>> >>>>>>> From this point on, VPCs were broken as they wouldn't get a default >>>>>> route. So, no public internet access from VMs in VPC tiers, no VPNs >>>>>> working, etc. This was mentioned to the list on Thursday September 15th, >>>>>> after some chats and debugging going on over the weekend ( >>>>>> http://cloudstack.markmail.org/message/73ulpu4p75ex24tc) >>>>>>> >>>>>>> Here we are, master is broken functionality wise and new Pull Requests >>>>>> come in to fix blockers. But we cannot ever test their proper working, >>>>>> because VPCs are broken in master and so also in the PRs branched off of >>>>>> it. With or without change in the PR. >>>>>>> >>>>>>> It starts to escalate as the days go by. >>>>>>> >>>>>>> I’ll leave out the bit on how this frustrated people. Although it’s good >>>>>> to know we do not want to be in this situation. >>>>>>> >>>>>>> Eventually Wilder and I spent an evening and a day working on a branch >>>>>> where we loaded 7 PRs on top of each other (all VR related) only to find >>>>>> the VPC is still broken. It allowed us to zoom in and find the default >>>>>> route was missing again. We said it worked 3 weeks before, because the >>>>>> same >>>>>> tests that succeeded then, now were broken. We had already fixed this in >>>>>> PR >>>>>> #738 on August 25 so were sure about it. >>>>>>> >>>>>>> After some digging we could trace it back to Pull Request #784. Imagine >>>>>> the feeling seeing your own comment there mentioning the previous issue >>>>>> on >>>>>> the default gateways. Fair to say our human review process clearly failed >>>>>> here. Many many hours were spent on this problem over the past two weeks. >>>>>> Could we have prevented this from happening? I think so, yes. >>>>>>> >>>>>>> >>>>>>> This example clearly shows why: >>>>>>> >>>>>>> - we should use Pull Requests >>>>>>> It made the change visible: Great! >>>>>>> >>>>>>> - we do reviews and ask for feedback >>>>>>> We got feedback and questions: Also great! >>>>>>> >>>>>>> - we should always respond to feedback and verify it is resolved, before >>>>>> merging >>>>>>> We need to improve here. Even with two reviewers that say LGTM, we >>>>>> should still address any feedback before merging. >>>>>>> >>>>>>> - we should have two humans doing a review >>>>>>> We need to improve here as well. Not one reviewer, we need two. Really. >>>>>>> >>>>>>> - we need to document why we say LGTM. >>>>>>> Another improvement. It’s nice to say LGTM, but a review of only 4 >>>>>> characters and nothing more is useless. We need to know what was tested >>>>>> and >>>>>> how. Test results, screen shots or anything that shows what's been >>>>>> verified. If you only reviewed the code, also fine but at least say that. >>>>>> Then the next reviewer should do another type of review to get the >>>>>> comlete >>>>>> picture. Remember you're replacing automated testing! >>>>>>> >>>>>>> - we should always merge Pull Requests >>>>>>> We made it easy, merging is the de facto standard, and it has even more >>>>>> benefits. You can trace commits back to their Pull Request (and find all >>>>>> comments and discussion there: saves time, trust me). It also allows for >>>>>> easier reverting of a Pull Request. We’ll see even more benefits once 4.7 >>>>>> is there. Although the intentions to merge the Pull Request were there, >>>>>> it >>>>>> still didn't happen. We should always check before we push. As a >>>>>> committer >>>>>> we just need to be sure. >>>>>>> >>>>>>> - we need automated testing! >>>>>>> The sooner the better. It’s all about the missing automated testing. >>>>>> After 4.6, we all need to focus on this. Saves a lot of time. And >>>>>> frustrations. >>>>>>> >>>>>>> >>>>>>> >>>>>>> We're doing final testing on PR #887 and will merge it soon. From that >>>>>> point on we can look into new issues. Be aware that any PR out there that >>>>>> was created after September 10 needs to be rebased with current master >>>>>> (when #887 is merged). Without that, no serious testing can be done. >>>>>>> >>>>>>> Let's be careful what to land on master. I'll only be merging Pull >>>>>> Requests that have had proper reviews with information on what was >>>>>> tested. >>>>>> At least one reviewer needs to actually verify it works (and show the >>>>>> rest >>>>>> of us). We simply cannot assume it will work. >>>>>>> >>>>>>> If we do this, I think we can start resolving the remaining blockers >>>>>> one-by-one and go into the first RC round. Please help out where you can >>>>>> so >>>>>> we can make this a success together. Thanks! >>>>>>> >>>>>>> Looking forward to the day we have our automated testing in place ;-) >>>>>>> >>>>>>> Regards, >>>>>>> Remi >>>>>>> >>>>>> >>>>>> >>>> >>