Hi Wilder, I am not talking about just the vpc networks. There are many other ares getting effected because of this, some of them are vpn(not implemented) , rvr in isolated networks etc. All i am saying is the design doc will help us understand the complete impact of the changes and deal with them accordingly.
Regards, Bharat. On 28-Sep-2015, at 6:02 pm, Wilder Rodrigues <wrodrig...@schubergphilis.com> wrote: > Only few tests…. 51 tests against a real environment. > > At that time Nux also tested it and we tried to get Paul Angus, Geoff and > Rohit from Shape Blue to test it as well. Nux found a couple of issues that > were reported and fixed (see email below). > > When I came back from holidays, 4 weeks ago, a PR containing 360 files > changed and almost 4000 lines, which was not even compiling, was merged onto > Master. We have less than a handful of people executing tests against PRs - > so few that I could even name who tests and who doesn’t. But hey, that’s a > blames email. I’m not trying to justify anything, but that handful of people, > who actually care about ACS, are getting quite fedup with this whole > discussion. > > Cheers, > Wilder > > =========================================================== > > On 20 Feb 2015, at 10:03, Nux! <n...@li.nux.ro<mailto:n...@li.nux.ro>> wrote: > > Well, it looks like we were right to test it, found some problems. > > 1 - the passwords for instances are not served properly, `wget --header > "DomU_Request: send_my_password" $router:8080` returns blank response. I am > not sure why this happens, though I tried to look around the router. > > 2 - in addition to the above, in a redundant VPC the SNAT does not work. > >From an instance I can ping the router(s), but not any further than that. > SNAT works fine in a normal/non-vpc network. > I'll try to look more into it later today. > > Have a nice day :) > > Lucian > > -- > Sent from the Delta quadrant using Borg technology! > > Nux! > www.nux.ro<http://www.nux.ro> > > > > > On 28 Sep 2015, at 14:13, Bharat Kumar > <bharat.ku...@citrix.com<mailto:bharat.ku...@citrix.com>> wrote: > > Hi Sebastien, > > You are confused, we are talking about persistent VR config changes. below > is the pr related to it. > https://github.com/apache/cloudstack/pull/118 > > If you look at it you will notice that there are more than 250 commits and > only a few tests that were run. > > Regards, > Bharat. > > On 28-Sep-2015, at 5:24 pm, Bharat Kumar > <bharat.ku...@citrix.com<mailto:bharat.ku...@citrix.com><mailto:bharat.ku...@citrix.com>> > wrote: > > Hi Remi, > > Whatever ever we think we have discovered are all well known best practices > while developing code in community. > I agree that tests need to be run on a new PR, but i wonder why was this > ignored when merging the VR refactor code. Perhaps we will uncover some more > issues if we investigate this. I believe > one of the reasons for this is the complexity and incomplete nature of the vr > refactor change and failing to identify the areas which got effected. If we > had a good documentation i think we cloud have understood the areas that were > getting > impacted early on, areas like the vpn , iptables, isolated networks rvr > etc and run the relevant tests. The documentation will also help us focus on > these areas while reviewing and fixing subsequent issues. Currently no one > knows the areas that got effected > due to the vr refactor change, we are seeing issues all over the code. I > think this is a bigger problem than what we have discussed so far. > > I think presently we should stop fixing all the vr refactoring bugs until we > come up with a proper document describing the VR refactoring changes. > > I am not suggesting that we should revert the vr refactor code, I am willing > to work on this and fix the issues, I am only asking if we can get some > documentation. > > > Regards, > Bharat. > > On 28-Sep-2015, at 4:59 pm, Sebastien Goasguen > <run...@gmail.com<mailto:run...@gmail.com><mailto:run...@gmail.com>> wrote: > > > On Sep 28, 2015, at 1:14 PM, Remi Bergsma > <rberg...@schubergphilis.com<mailto:rberg...@schubergphilis.com><mailto:rberg...@schubergphilis.com>> > wrote: > > Hi Bharat, > > > There is only one way to prove a feature works: with tests. That’s why I say > actually _running_ the tests we have today on any new PR, is the most > important thing. Having no documentation is a problem, I agree, but it is not > more important IMHO. If we had the documentation, we still would have issues > if nobody runs the tests and verifies pull requests. Documentation that is > perfect does not automatically lead to perfect implementation. So we need > tests to verify. > > If we don’t agree that is also fine. We need to do both anyway and I think we > do agree on that. > > > Also we need to move forward. We should have a live chat once 4.6 is out to > discuss all issues/problems and iron out the process. > > But reverting the VR refactor is not going to happen. There was ample > discussions on the PR when it was submitted, there was time to review and > raise concerns at that time. It went through quite a few reviews, tests > etc…Maybe the documentation is not good, but the time to raise this concern I > am afraid was six months ago. We can learn from it, but we are not going to > revert it, this would not go cleanly as David mentioned. > > So let’s get back to blockers for 4.6, are there still VR related issues with > master ? > > > > > Regards, > Remi > > > > > > > On 28/09/15 12:15, "Bharat Kumar" > <bharat.ku...@citrix.com<mailto:bharat.ku...@citrix.com><mailto:bharat.ku...@citrix.com>> > wrote: > > Hi Remi, > > i do not agree with “There is no bigger problem” part of your reply. so I > had to repeat myself to make it more clear, Not because i am not aware of > what this thread is supposed to do. > > Regards, > Bharat. > > On 28-Sep-2015, at 2:51 pm, Remi Bergsma > <rberg...@schubergphilis.com<mailto:rberg...@schubergphilis.com><mailto:rberg...@schubergphilis.com>> > wrote: > > Hi Bharat, > > I understand your frustrations but we already agreed on this so no need to > repeat. This thread is supposed to list some improvements and learn from it. > Your point has been taken so let’s move on. > > We need documentation first, then do a change after which all tests should > pass. Even better is we write (missing) tests before changing stuff so you > know they pass before and after the fact. > > When doing reviews, feel free to ask for design documentation if you feel it > is needed. > > Regards, Remi > > > > On 28/09/15 11:02, "Bharat Kumar" > <bharat.ku...@citrix.com<mailto:bharat.ku...@citrix.com><mailto:bharat.ku...@citrix.com>> > wrote: > > Hi Remi, > > I never intended to say that we should not run tests, but even before tests > we should have proper documentation. My concern was if a major change is > being introduced it should be properly documented. All the issues which we > are trying to fix are majorly due to VR refactor. If there was a proper > documentation for this we could > have fixed this in a better way. Even to add tests we need to understand how > a particular thing works and what data dose it expect. I think while fixing > the python based code changes this is where most of the people are facing > issues. A proper documentation will help in understanding these in a better > way. > > Thanks, > Bharat. > > On 28-Sep-2015, at 1:57 pm, Remi Bergsma > <rberg...@schubergphilis.com<mailto:rberg...@schubergphilis.com><mailto:rberg...@schubergphilis.com>> > wrote: > > Hi Bharat, > > There is no bigger problem. We should always run the tests and if we find a > case that isn’t currently covered by the tests we should simply add tests for > it. There’s no way we’ll get a stable master without them. The fact that they > may not cover everything, is no reason to not rely on them. If a feature is > not important enough to write a test for, then the feature is probably not > important anyway. And if it is, then add a test :-) > > I do agree on the design documentation requirement for any (major?) change. I > found some design documentations on the subject you mention, but it should > have been more detailed. > > Regards, > Remi > > > > > > > On 28/09/15 09:58, "Bharat Kumar" > <bharat.ku...@citrix.com<mailto:bharat.ku...@citrix.com><mailto:bharat.ku...@citrix.com>> > wrote: > > Hi Remi, > > Thank you for the Blame less postmortem. > > I think there is a bigger problem here than just the review process and > running tests. Even if we run the tests we cannot be sure that every thing > will work as intended. The tests will only give some level of confidence. The > tests may not cover all the use cases. > > I think the problem here is that the way major changes to the code base are > dealt with. For example, VR refactoring was done without discussing the > design implications and the amount of changes it would bring in. I could not > find any design document. The vr refactor changed a lot of code and the way > VR used to work and in my opinion it was incomplete-vpn, isolated networks, > basic networks, iptable rules and rvr in isolated case etc were not > implemented. Most of us are still in the process of understanding this. Even > before reaching this state we had to spend a lot of time fixing issues > mentioned in the thread [Blocker/Critical] VR related Issues. > > When a change of this magnitude is being made, we should call out all the > changes and document them properly. This will help people to create better > fixes. Currently when we attempt to fix the isolated vr case it is effecting > the vpc and vice versa. for example pr 738 fixed it for vpc networks but > broke it for isolated case. I believe it is not too late to at least start > documenting the changes now. > > Thanks, > Bharat. > > On 28-Sep-2015, at 10:52 am, Sanjeev N > <sanj...@apache.org<mailto:sanj...@apache.org><mailto:sanj...@apache.org>> > wrote: > > I have a concern here. Some of us are actively involved in reviewing the > PRs related to marvin tests(Enhancing existing tests/Adding new tests). If > we have to test a PR it requires an environment to be created with actual > resources and this is going to take lot of time. Some of the tests can run > on simulator but most of the tests require real hardware to test. PR > submitter is already testing and submitting the test results along with the > PR. So is it require to test these PRs by reviewers? > > On Sat, Sep 26, 2015 at 1:49 PM, sebgoa > <run...@gmail.com<mailto:run...@gmail.com><mailto:run...@gmail.com>> wrote: > > Remi, thanks for the detailed post-mortem, it's a good read and great > learning. > I hope everyone reads it. > > The one thing to emphasize is that we now have a very visible way to get > code into master, we have folks investing time to provide review (great), > we need the submitters to make due diligence and answer all comments in the > reviews. > > In another project i work on, nothing can be added to the code without > unit tests. I think we could go down the route of asking for new > integration tests and unit tests for anything. If not, the PR does not get > merged. But let's digest your post-mortem and we can discuss after 4.6.0. > > I see that you reverted one commit that was not made by you, that's great. > > Let's focus on the blockers now, everything else can wait. > > The big bonus of doing what we are doing is that once 4.6.0 is out, we can > merge PRs again (assuming they are properly rebased and tested) and we can > release 4.6.1 really quickly after. > > -sebastien > > On Sep 25, 2015, at 9:51 PM, Remi Bergsma > <rberg...@schubergphilis.com<mailto:rberg...@schubergphilis.com><mailto:rberg...@schubergphilis.com>> > wrote: > > Hi all, > > This mail is intended to be blameless. We need to learn something from > it. That's why I left out who exactly did what because it’s not relevant. > There are multiple examples but it's about the why. Let's learn from this > without blaming anyone. > > We know we need automated testing. We have integration tests, but we are > unable to run all of them on any Pull Request we receive. If we would have > that in place, it'd be much easier to spot errors, regression and so on. > It'd also be more rewarding to write more tests. > > Unfortunately we're not there yet. So, we need to do something else > instead until we get there. If we do nothing, we know we have many issues > because a master that breaks on a regular basis is the most frustrating > things. We said we'd use Pull Requests with at least two humans to review > and give their OK for a Pull Request. In the form of LGTM: Looks Good To > Me. Ok, so the LGTMs are there because we have no automated testing. Keep > that in mind. You are supposed to replace automated testing until it's > there. > > Since we do this, master got a lot more stable. But every now and then > we still have issues. Let's look at how we do manual reviews. Again, this > is not to blame anyone. It's to open our eyes and make us realise what > we're doing and what results we get out of that. > > > Example Pull Request #784: > Title: CLOUDSTACK-8799 fixed the default routes > > That's nice, it has a Jira id and a short description (as it should be). > > The first person comes along and makes a comment: > "There was also an issue with VPC VRs" ... "Have you seen this issue? > Does your change affects the VPC VR (single/redundant)?" > > Actually a good question. Unfortunaly there comes no answer. After a > reminder, it was promised to do tests against VPC networks. Great! > > The Jenkins builds both succeed and also Travis is green. But how much > value does this have? They have the impression to do automated testing, and > although you could argue they do, it's far from complete. If it breaks, you > know you have an issue. But it doesn’t work the other way around. > > Back to our example PR. In the mean time, another commit gets pushed to > it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at the Jira > issue, you see it is about redundant virtual routers. The non-VPC ones. So > this is vague at best. But a reviewer gives a LGTM because the person could > create a VPC. That doesn't have anything to do with the problem being fixed > in this PR nor with the comments made earlier. But, at least the person > said what he did and we should all do that. What nobody knew back then, was > that this broke the default route on VPCs. > > Then something strange happens: the two commits from the PR end up on > master as direct commits. With just one LGTM and no verification from the > person commenting about the linked issue. This happened on Friday September > 11th. > > That day 21 commits came in, from 7 Pull Request and unfortunately also > from some direct commits. We noticed the direct commits and notified the > list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As a lot > came in at the same time, it was decided not to revert them. Looking back, > we should have done it. > > From this point on, VPCs were broken as they wouldn't get a default > route. So, no public internet access from VMs in VPC tiers, no VPNs > working, etc. This was mentioned to the list on Thursday September 15th, > after some chats and debugging going on over the weekend ( > http://cloudstack.markmail.org/message/73ulpu4p75ex24tc) > > Here we are, master is broken functionality wise and new Pull Requests > come in to fix blockers. But we cannot ever test their proper working, > because VPCs are broken in master and so also in the PRs branched off of > it. With or without change in the PR. > > It starts to escalate as the days go by. > > I’ll leave out the bit on how this frustrated people. Although it’s good > to know we do not want to be in this situation. > > Eventually Wilder and I spent an evening and a day working on a branch > where we loaded 7 PRs on top of each other (all VR related) only to find > the VPC is still broken. It allowed us to zoom in and find the default > route was missing again. We said it worked 3 weeks before, because the same > tests that succeeded then, now were broken. We had already fixed this in PR > #738 on August 25 so were sure about it. > > After some digging we could trace it back to Pull Request #784. Imagine > the feeling seeing your own comment there mentioning the previous issue on > the default gateways. Fair to say our human review process clearly failed > here. Many many hours were spent on this problem over the past two weeks. > Could we have prevented this from happening? I think so, yes. > > > This example clearly shows why: > > - we should use Pull Requests > It made the change visible: Great! > > - we do reviews and ask for feedback > We got feedback and questions: Also great! > > - we should always respond to feedback and verify it is resolved, before > merging > We need to improve here. Even with two reviewers that say LGTM, we > should still address any feedback before merging. > > - we should have two humans doing a review > We need to improve here as well. Not one reviewer, we need two. Really. > > - we need to document why we say LGTM. > Another improvement. It’s nice to say LGTM, but a review of only 4 > characters and nothing more is useless. We need to know what was tested and > how. Test results, screen shots or anything that shows what's been > verified. If you only reviewed the code, also fine but at least say that. > Then the next reviewer should do another type of review to get the comlete > picture. Remember you're replacing automated testing! > > - we should always merge Pull Requests > We made it easy, merging is the de facto standard, and it has even more > benefits. You can trace commits back to their Pull Request (and find all > comments and discussion there: saves time, trust me). It also allows for > easier reverting of a Pull Request. We’ll see even more benefits once 4.7 > is there. Although the intentions to merge the Pull Request were there, it > still didn't happen. We should always check before we push. As a committer > we just need to be sure. > > - we need automated testing! > The sooner the better. It’s all about the missing automated testing. > After 4.6, we all need to focus on this. Saves a lot of time. And > frustrations. > > > > We're doing final testing on PR #887 and will merge it soon. From that > point on we can look into new issues. Be aware that any PR out there that > was created after September 10 needs to be rebased with current master > (when #887 is merged). Without that, no serious testing can be done. > > Let's be careful what to land on master. I'll only be merging Pull > Requests that have had proper reviews with information on what was tested. > At least one reviewer needs to actually verify it works (and show the rest > of us). We simply cannot assume it will work. > > If we do this, I think we can start resolving the remaining blockers > one-by-one and go into the first RC round. Please help out where you can so > we can make this a success together. Thanks! > > Looking forward to the day we have our automated testing in place ;-) > > Regards, > Remi >