On 12/18/2013 10:54 PM, Jay Pipes wrote: > On 12/18/2013 10:21 PM, Brent Eagles wrote: >> Hi, >> >> Yair and I were discussing a change that I initiated and was >> incorporated into the test_network_basic_ops test. It was intended as a >> configuration control point for floating IP address assignments before >> actually testing connectivity. The question we were discussing was >> whether this check was a valid pass/fail criteria for tests like >> test_network_basic_ops. >> >> The initial motivation for the change was that test_network_basic_ops >> had a less than 50/50 chance of passing in my local environment for >> whatever reason. After looking at the test, it seemed ridiculous that it >> should be failing. The problem is that more often than not the data that >> was available in the logs all pointed to it being set up correctly but >> the ping test for connectivity was timing out. From the logs it wasn't >> clear that the test was failing because neutron did not do the right >> thing, did not do it fast enough, or is something else happening? Of >> course if I paused the test for a short bit between setup and the checks >> to manually verify everything the checks always passed. So it's a timing >> issue right? >> >> Two things: adding more timeout to a check is as appealing to me as >> gargling glass AND I was less "annoyed" that the test was failing as I >> was that it wasn't clear from reading logs what had gone wrong. I tried >> to find an additional intermediate control point that would "split" >> failure modes into two categories: neutron is too slow in setting things >> up and neutron failed to set things up correctly. Granted it still is >> adding timeout to the test, but if I could find a control point based on >> "settling" so that if it passed, then there is a good chance that if the >> next check failed it was because neutron actually screwed up what it was >> trying to do. >> >> Waiting until the query on the nova for the floating IP information >> seemed a relatively reasonable, if imperfect, "settling" criteria before >> attempting to connect to the VM. Testing to see if the floating IP >> assignment gets to the nova instance details is a valid test and, >> AFAICT, missing from the current tests. However, Yair has the reasonable >> point that connectivity is often available long before the floating IP >> appears in the nova results and that it could be considered invalid to >> use non-network specific criteria as pass/fail for this test. > > But, Tempest is all about functional integration testing. Using a call > to Nova's server details to determine whether a dependent call to > Neutron succeeded (setting up the floating IP) is exactly what I think > Tempest is all about. It's validating that the integration between Nova > and Neutron is working as expected. > > So, I actually think the assertion on the floating IP address appearing > (after some timeout/timeout-backoff) is entirely appropriate. > >> In general, the validity of checking for the presence of a floating IP >> in the server details is a matter of interpretation. I think it is a >> given that it must be tested somewhere and that if it causes a test to >> fail then it is as valid a failure than a ping failing. Certainly I have >> seen scenarios where an IP appears, but doesn't actually work and others >> where the IP doesn't appear (ever, not just in really long while) but >> magically works. Both are bugs. Which is more appropriate to tests like >> test_network_basic_ops? > > I believe both assertions should be part of the test cases, but since > the latter condition (good ping connectivity, but no floater ever > appears attached to the instance) necessarily depends on the first > failure (floating IP does not appear in the server details after a > timeout), then perhaps one way to handle this would be to do this: > > a) create server instance > b) assign floating ip > c) query server details looking for floater in a timeout-backoff loop > c1) floater does appear > c1-a) assert ping connectivity > c2) floater does not appear > c2-a) check ping connectivity. if ping connectivity succeeds, use a > call to testtools.TestCase.addDetail() to provide some "interesting" > feedback > c2-b) raise assertion that floater did not appear in the server details > >> Currently, the polling interval for the checks in the gate should be >> tuned. They are borrowing other polling configuration and I can see it >> is ill-advised. It is currently polling at an interval of a second and >> if the intent is to wait for the entire system to settle down before >> proceeding then polling nova that quickly is too often. It simply >> increases the load while we are waiting to adapt to a loaded system. For >> example in the course of a three minute timeout, the floating IP check >> polled nova for server details 180 times. > > Agreed completely.
We should just add an exponential backoff to the waiting. That should decrease load over time. I'd be +2 to such a patch. That being said.... I'm not sure why 1 request / sec is considered load that would break the system. That doesn't seem a completely unreasonable load. If you look at the sysstat log in the gate runs where things fail, you will be able to see current load where this doesn't work. -Sean -- Sean Dague Samsung Research America s...@dague.net / sean.da...@samsung.com http://dague.net
signature.asc
Description: OpenPGP digital signature
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev