Re: Blameless post mortem

Bharat Kumar Mon, 28 Sep 2015 03:15:54 -0700

Hi Remi,

 i do not agree with “There is no bigger problem”  part of your reply. so I had 
to repeat myself to make it more clear, Not because i am not aware of what this 
thread is supposed to do.
 
Regards,
Bharat.


On 28-Sep-2015, at 2:51 pm, Remi Bergsma <rberg...@schubergphilis.com> wrote:

> Hi Bharat,
> 
> I understand your frustrations but we already agreed on this so no need to 
> repeat. This thread is supposed to list some improvements and learn from it. 
> Your point has been taken so let’s move on.
> 
> We need documentation first, then do a change after which all tests should 
> pass. Even better is we write (missing) tests before changing stuff so you 
> know they pass before and after the fact. 
> 
> When doing reviews, feel free to ask for design documentation if you feel it 
> is needed.
> 
> Regards, Remi
> 
> 
> 
> On 28/09/15 11:02, "Bharat Kumar" <bharat.ku...@citrix.com> wrote:
> 
>> Hi Remi,
>> 
>> I never intended to say that we should not run tests, but even before tests 
>> we should have proper documentation. My concern was if a major change is 
>> being introduced it should be properly documented. All the issues which we 
>> are trying to fix are majorly due to VR refactor. If there was a proper 
>> documentation for this we could
>> have fixed this in a better way.  Even to add tests we need to understand 
>> how a particular thing works and what data dose it expect. I think while 
>> fixing the python based code changes this is where most of the people are 
>> facing issues. A proper documentation will help in understanding these in a 
>> better way.
>> 
>> Thanks,
>> Bharat.
>> 
>> On 28-Sep-2015, at 1:57 pm, Remi Bergsma <rberg...@schubergphilis.com> wrote:
>> 
>>> Hi Bharat,
>>> 
>>> There is no bigger problem. We should always run the tests and if we find a 
>>> case that isn’t currently covered by the tests we should simply add tests 
>>> for it. There’s no way we’ll get a stable master without them. The fact 
>>> that they may not cover everything, is no reason to not rely on them. If a 
>>> feature is not important enough to write a test for, then the feature is 
>>> probably not important anyway. And if it is, then add a test :-)
>>> 
>>> I do agree on the design documentation requirement for any (major?) change. 
>>> I found some design documentations on the subject you mention, but it 
>>> should have been more detailed. 
>>> 
>>> Regards,
>>> Remi
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 28/09/15 09:58, "Bharat Kumar" <bharat.ku...@citrix.com> wrote:
>>> 
>>>> Hi Remi,
>>>> 
>>>> Thank you for the Blame less postmortem. 
>>>> 
>>>> I think there is a bigger problem here than just the review process and 
>>>> running tests. Even if we run the tests we cannot be sure that every thing 
>>>> will work as intended. The tests will only give some level of confidence. 
>>>> The tests may not cover all the use cases.
>>>> 
>>>> I think the problem here is that the way major changes to the code base 
>>>> are dealt with. For example,  VR refactoring was done without discussing 
>>>> the design implications and the amount of changes it would bring in. I 
>>>> could not find any design document. The vr refactor changed a lot of code 
>>>> and the way VR used to work and in my opinion it was incomplete-vpn, 
>>>> isolated networks, basic networks, iptable rules and rvr in isolated case 
>>>> etc were not implemented. Most of us are still in the process of 
>>>> understanding this. Even before reaching this state we had to spend a lot 
>>>> of time fixing issues mentioned in the thread [Blocker/Critical] VR 
>>>> related Issues.  
>>>> 
>>>> When a change of this magnitude is being made, we should call out all the 
>>>> changes and document them properly. This will help people to create better 
>>>> fixes. Currently when we attempt to fix the isolated vr case it is 
>>>> effecting the vpc and vice versa. for example pr 738 fixed it for vpc 
>>>> networks but broke it for isolated case. I believe it is not too late to 
>>>> at least start documenting the changes now.
>>>> 
>>>> Thanks,
>>>> Bharat.
>>>> 
>>>> On 28-Sep-2015, at 10:52 am, Sanjeev N <sanj...@apache.org> wrote:
>>>> 
>>>>> I have a concern here. Some of us are actively involved in reviewing the
>>>>> PRs related to marvin tests(Enhancing existing tests/Adding new tests). If
>>>>> we have to test a PR it requires an environment to be created with actual
>>>>> resources and this is going to take lot of time. Some of the tests can run
>>>>> on simulator but most of the tests require real hardware to test. PR
>>>>> submitter is already testing and submitting the test results along with 
>>>>> the
>>>>> PR. So is it require to test these PRs by reviewers?
>>>>> 
>>>>> On Sat, Sep 26, 2015 at 1:49 PM, sebgoa <run...@gmail.com> wrote:
>>>>> 
>>>>>> Remi, thanks for the detailed post-mortem, it's a good read and great
>>>>>> learning.
>>>>>> I hope everyone reads it.
>>>>>> 
>>>>>> The one thing to emphasize is that we now have a very visible way to get
>>>>>> code into master, we have folks investing time to provide review (great),
>>>>>> we need the submitters to make due diligence and answer all comments in 
>>>>>> the
>>>>>> reviews.
>>>>>> 
>>>>>> In another project i work on, nothing can be added to the code without
>>>>>> unit tests. I think we could go down the route of asking for new
>>>>>> integration tests and unit tests for anything. If not, the PR does not 
>>>>>> get
>>>>>> merged. But let's digest your post-mortem and we can discuss after 4.6.0.
>>>>>> 
>>>>>> I see that you reverted one commit that was not made by you, that's 
>>>>>> great.
>>>>>> 
>>>>>> Let's focus on the blockers now, everything else can wait.
>>>>>> 
>>>>>> The big bonus of doing what we are doing is that once 4.6.0 is out, we 
>>>>>> can
>>>>>> merge PRs again (assuming they are properly rebased and tested) and we 
>>>>>> can
>>>>>> release 4.6.1 really quickly after.
>>>>>> 
>>>>>> -sebastien
>>>>>> 
>>>>>> On Sep 25, 2015, at 9:51 PM, Remi Bergsma <rberg...@schubergphilis.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> This mail is intended to be blameless. We need to learn something from
>>>>>> it. That's why I left out who exactly did what because it’s not relevant.
>>>>>> There are multiple examples but it's about the why. Let's learn from this
>>>>>> without blaming anyone.
>>>>>>> 
>>>>>>> We know we need automated testing. We have integration tests, but we are
>>>>>> unable to run all of them on any Pull Request we receive. If we would 
>>>>>> have
>>>>>> that in place, it'd be much easier to spot errors, regression and so on.
>>>>>> It'd also be more rewarding to write more tests.
>>>>>>> 
>>>>>>> Unfortunately we're not there yet. So, we need to do something else
>>>>>> instead until we get there. If we do nothing, we know we have many issues
>>>>>> because a master that breaks on a regular basis is the most frustrating
>>>>>> things. We said we'd use Pull Requests with at least two humans to review
>>>>>> and give their OK for a Pull Request. In the form of LGTM: Looks Good To
>>>>>> Me. Ok, so the LGTMs are there because we have no automated testing. Keep
>>>>>> that in mind. You are supposed to replace automated testing until it's
>>>>>> there.
>>>>>>> 
>>>>>>> Since we do this, master got a lot more stable. But every now and then
>>>>>> we still have issues. Let's look at how we do manual reviews. Again, this
>>>>>> is not to blame anyone. It's to open our eyes and make us realise what
>>>>>> we're doing and what results we get out of that.
>>>>>>> 
>>>>>>> 
>>>>>>> Example Pull Request #784:
>>>>>>> Title: CLOUDSTACK-8799 fixed the default routes
>>>>>>> 
>>>>>>> That's nice, it has a Jira id and a short description (as it should be).
>>>>>>> 
>>>>>>> The first person comes along and makes a comment:
>>>>>>> "There was also an issue with VPC VRs" ... "Have you seen this issue?
>>>>>> Does your change affects the VPC VR (single/redundant)?"
>>>>>>> 
>>>>>>> Actually a good question. Unfortunaly there comes no answer. After a
>>>>>> reminder, it was promised to do tests against VPC networks. Great!
>>>>>>> 
>>>>>>> The Jenkins builds both succeed and also Travis is green. But how much
>>>>>> value does this have? They have the impression to do automated testing, 
>>>>>> and
>>>>>> although you could argue they do, it's far from complete. If it breaks, 
>>>>>> you
>>>>>> know you have an issue. But it doesn’t work the other way around.
>>>>>>> 
>>>>>>> Back to our example PR. In the mean time, another commit gets pushed to
>>>>>> it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at the Jira
>>>>>> issue, you see it is about redundant virtual routers. The non-VPC ones. 
>>>>>> So
>>>>>> this is vague at best. But a reviewer gives a LGTM because the person 
>>>>>> could
>>>>>> create a VPC. That doesn't have anything to do with the problem being 
>>>>>> fixed
>>>>>> in this PR nor with the comments made earlier. But, at least the person
>>>>>> said what he did and we should all do that. What nobody knew back then, 
>>>>>> was
>>>>>> that this broke the default route on VPCs.
>>>>>>> 
>>>>>>> Then something strange happens: the two commits from the PR end up on
>>>>>> master as direct commits. With just one LGTM and no verification from the
>>>>>> person commenting about the linked issue. This happened on Friday 
>>>>>> September
>>>>>> 11th.
>>>>>>> 
>>>>>>> That day 21 commits came in, from 7 Pull Request and unfortunately also
>>>>>> from some direct commits. We noticed the direct commits and notified the
>>>>>> list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As a lot
>>>>>> came in at the same time, it was decided not to revert them. Looking 
>>>>>> back,
>>>>>> we should have done it.
>>>>>>> 
>>>>>>> From this point on, VPCs were broken as they wouldn't get a default
>>>>>> route. So, no public internet access from VMs in VPC tiers, no VPNs
>>>>>> working, etc. This was mentioned to the list on Thursday September 15th,
>>>>>> after some chats and debugging going on over the weekend (
>>>>>> http://cloudstack.markmail.org/message/73ulpu4p75ex24tc)
>>>>>>> 
>>>>>>> Here we are, master is broken functionality wise and new Pull Requests
>>>>>> come in to fix blockers. But we cannot ever test their proper working,
>>>>>> because VPCs are broken in master and so also in the PRs branched off of
>>>>>> it. With or without change in the PR.
>>>>>>> 
>>>>>>> It starts to escalate as the days go by.
>>>>>>> 
>>>>>>> I’ll leave out the bit on how this frustrated people. Although it’s good
>>>>>> to know we do not want to be in this situation.
>>>>>>> 
>>>>>>> Eventually Wilder and I spent an evening and a day working on a branch
>>>>>> where we loaded 7 PRs on top of each other (all VR related) only to find
>>>>>> the VPC is still broken. It allowed us to zoom in and find the default
>>>>>> route was missing again. We said it worked 3 weeks before, because the 
>>>>>> same
>>>>>> tests that succeeded then, now were broken. We had already fixed this in 
>>>>>> PR
>>>>>> #738 on August 25 so were sure about it.
>>>>>>> 
>>>>>>> After some digging we could trace it back to Pull Request #784. Imagine
>>>>>> the feeling seeing your own comment there mentioning the previous issue 
>>>>>> on
>>>>>> the default gateways. Fair to say our human review process clearly failed
>>>>>> here. Many many hours were spent on this problem over the past two weeks.
>>>>>> Could we have prevented this from happening? I think so, yes.
>>>>>>> 
>>>>>>> 
>>>>>>> This example clearly shows why:
>>>>>>> 
>>>>>>> - we should use Pull Requests
>>>>>>> It made the change visible: Great!
>>>>>>> 
>>>>>>> - we do reviews and ask for feedback
>>>>>>> We got feedback and questions: Also great!
>>>>>>> 
>>>>>>> - we should always respond to feedback and verify it is resolved, before
>>>>>> merging
>>>>>>> We need to improve here. Even with two reviewers that say LGTM, we
>>>>>> should still address any feedback before merging.
>>>>>>> 
>>>>>>> - we should have two humans doing a review
>>>>>>> We need to improve here as well. Not one reviewer, we need two. Really.
>>>>>>> 
>>>>>>> - we need to document why we say LGTM.
>>>>>>> Another improvement. It’s nice to say LGTM, but a review of only 4
>>>>>> characters and nothing more is useless. We need to know what was tested 
>>>>>> and
>>>>>> how. Test results, screen shots or anything that shows what's been
>>>>>> verified. If you only reviewed the code, also fine but at least say that.
>>>>>> Then the next reviewer should do another type of review to get the 
>>>>>> comlete
>>>>>> picture. Remember you're replacing automated testing!
>>>>>>> 
>>>>>>> - we should always merge Pull Requests
>>>>>>> We made it easy, merging is the de facto standard, and it has even more
>>>>>> benefits. You can trace commits back to their Pull Request (and find all
>>>>>> comments and discussion there: saves time, trust me). It also allows for
>>>>>> easier reverting of a Pull Request. We’ll see even more benefits once 4.7
>>>>>> is there. Although the intentions to merge the Pull Request were there, 
>>>>>> it
>>>>>> still didn't happen. We should always check before we push. As a 
>>>>>> committer
>>>>>> we just need to be sure.
>>>>>>> 
>>>>>>> - we need automated testing!
>>>>>>> The sooner the better. It’s all about the missing automated testing.
>>>>>> After 4.6, we all need to focus on this. Saves a lot of time. And
>>>>>> frustrations.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> We're doing final testing on PR #887 and will merge it soon. From that
>>>>>> point on we can look into new issues. Be aware that any PR out there that
>>>>>> was created after September 10 needs to be rebased with current master
>>>>>> (when #887 is merged). Without that, no serious testing can be done.
>>>>>>> 
>>>>>>> Let's be careful what to land on master. I'll only be merging Pull
>>>>>> Requests that have had proper reviews with information on what was 
>>>>>> tested.
>>>>>> At least one reviewer needs to actually verify it works (and show the 
>>>>>> rest
>>>>>> of us). We simply cannot assume it will work.
>>>>>>> 
>>>>>>> If we do this, I think we can start resolving the remaining blockers
>>>>>> one-by-one and go into the first RC round. Please help out where you can 
>>>>>> so
>>>>>> we can make this a success together. Thanks!
>>>>>>> 
>>>>>>> Looking forward to the day we have our automated testing in place ;-)
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Remi
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>

Re: Blameless post mortem

Reply via email to