On Sat, Oct 3, 2015 at 9:50 PM, Amir Rohan <amir.ro...@zoho.com> wrote: > On 10/03/2015 02:38 PM, Michael Paquier wrote: >> On Fri, Oct 2, 2015 at 11:10 PM, Amir Rohan wrote: >>> On 10/02/2015 03:33 PM, Michael Paquier wrote: >>>>> 4) Port assignment relies on liveness checks on running servers. >>>>> If a server is shut down and a new instantiated, the port will get >>>>> reused, data will get trashed, and various confusing things can happen. >>>> >>>> Right. The safest way to do that is to check in get_free_port if a >>>> port number is used by a registered node, and continue to loop in if >>>> that's the case. So done. >>>> >>> >>> That eliminates the "sweet and gentle" variant of the scenario, but it's >>> susceptible to the "ABA problem": >>> https://en.wikipedia.org/wiki/ABA_problem >>> https://youtu.be/CmxkPChOcvw?t=786 >> >> I learnt a new thing here. That's basically an existing problem even >> with the existing perl test infrastructure relying on TestLib.pm when >> tests are run in parallel. What we would need here is a global mapping >> file storing all the port numbers used by all the nodes currently in >> the tests. >> > > Yeah, a poorman's way to ensure ports aren't reused (I wasn't very > clear at top of post ) is something like: > > global nPortsAssigned = 0; > > AssignPort(): > basePort = BASE_PORT; # the lowest port we use > while(!available(basePort+nPortsAssigned)): > basePort++ > > nPortsAssigned++ > > return basePort; > > It has its glaring faults, but would probably work ok. > In any case, I'm sure you can do better.
Yeah, this would improve the exiting port lookup. I don't mind adding a global variable in get_free_port for this purpose. This would accelerate finding a free port in may cases for sure. >>> >>> Granted, you have to try fairly hard to shoot yourself in the leg, >>> but since the solution is so simple, why not? If we never reuse ports >>> within a single test, this goes away. >> >> Well, you can reuse the same port number in a test. Simply teardown >> the existing node and then recreate a new one. I think that port >> number assignment to a node should be transparent to the caller, in >> our case the perl test script holding a scenario. >> > > I was using you *never* want to reuse port numbers. That is, as long > as the lib ensures we never reuse ports within one test, all kinds > of corner cases are eliminated. Hm, sure. Though I don't really why that would be mandatory to enforce this condition as long as the list of ports occupied is in a single place (as long as tests are not run in parallel...). >>>>> 5) Servers are shutdown with -m 'immediate', which can lead to races >>>>> in the script when archiving is turned on. That may be good for some >>>>> tests, but there's no control over it. >>>> >>>> I hesitated with fast here actually. So changed this way. We would >>>> want as wall a teardown command to stop the node with immediate and >>>> unregister the node from the active list. >>>> >>> >>> In particular, I was shutting down an archiving node and the archiving >>> was truncated. I *think* smart doesn't do this. But again, it's really >>> that the test writer can't easily override, not that the default is wrong. >> >> Ah, OK. Then fast is just fine. It shuts down the node correctly. >> "smart" would wait for all the current connections to finish but I am >> wondering if it currently matters here: I don't see yet a clear use >> case yet where it would make sense to have multi-threaded script... If >> somebody comes back with a clear idea here perhaps we could revisit >> that but it does not seem worth it now. >> > > My mistake. Perhaps what would be useful though is a way > to force "messy" shutdown. a kill -9, basically. It's a recovery > test suite, right?. That's what the teardown is aimed at having, the immediate stop mode would play that fairly good enough. There has been a patch from Tom Lane around to stop a server should its postmaster.pid be missing as well... >>>>> Other issues: >>>>> 6. Directory structure, used one directory per thing but more logical >>>>> to place all things related to an instance under a single directory, >>>>> and name them according to role (57333_backup, and so on). >>>> >>>> Er, well. The first version of the patch did so, and then I switched >>>> to an approach closer to what the existing TAP facility is doing. But >>>> well let's simplify things a bit. >>>> >>> >>> I know, I know, but: >>> 1) an instance is a "thing" in your script, so having its associated >>> paraphernalia in one place makes more sense (maybe only to me). >>> 2) That's often how folks (well, how I) arrange things in deployment, >>> at least with archive/backup as symlinks to the nas. >>> >>> Alternatively, naming the dirs with a prefix (srv_foo_HASH, >>> backup_foo_backupname_HASH, etc') would work as well. >> >> The useful portion about tempdir is that it cleans up itself >> automatically should an error happen. It does not seem to me we want >> use that. >> > > Ensuring cleanup and directory structure aren't inherently related. > Testlib makes cleanup easy if you're willing to accept its flat > structure. But writing something that does cleanup and lets yo > control directory structure is perfectly doable. > > The question is only if you agree or not that having per-server > directories could be convenient. Tying into the next, if you > don't think anyone ever need to look into these directories > (which I disagree with), then dir structure indeed doesn't matter. So your point is having one temp dir for the whole, right? I don't disagree with that. >> I would expect hackers to run those runs >> until the end. > > I agree -- when you're running them , but what about when you're > /writing/ them? Well, I enforce CLEANUP=0 manually in TestLib.pm for now. >>>>> 11. a canned "server is responding to queries" helper would be convenient. >>>> >>>> Do you mean a wrapper on pg_isready? Do you have use cases in mind for it? >>>> >>> >>> Block until recovery is finished, before testing. eliminate races, and >>> avoid the stupid sleep(3) I used. >> >> TODO Well. I just recalled this item in the list of things you mentioned. I marked it but forgot to address it. It sounds right that we may want something using pg_isready in a loop as a node in recovery would reject connections. >> >>>>> 4b) server shutdown should perhaps be "smart" by default, or segmented >>>>> into calmly_bring_to_a_close(), pull_electric_plug() and >>>>> drop_down_the_stairs_into_swimming_pool(). >>>> >>>> Nope, not agreeing here. "immediate" is rather violent to stop a node, >>>> hence I have switched it to use "fast" and there is now a >>>> teardown_node routine that uses immediate, that's more aimed at >>>> cleanup up existing things fiercely. >>>> >>> >>> Ok, not as the default, but possible to request a specific kind of >>> shutdown. I needed smart in my case. Plus, in a scenario, you might >>> expressly be testing behavior for a specific mode, it needs to be >>> controllable. >> >> If your test script is running with a single thread, "fast" or "smart" >> would not really make a difference, no? > > It would If there's a bug in one of them and I'm trying to write > a regression test for it. Recall, this was part of broader view > of "provide defaults, allow override" I was suggesting. We could then extend stop_node with an optional argument containing a mode, with fast being the default. Sounds right? >>>> I have as well moved RecoveryTest.pm to src/test/perl so as all the >>>> consumers of prove_check can use it by default, and decoupled >>>> start_node from make_master and make_*_standby so as it is possible to >>>> add for example new parameters to their postgresql.conf and >>>> recovery.conf files before starting them. >>>> >>>> Thanks a lot for the feedback! Attached is an updated patch with all >>>> the things mentioned above done. Are included as well the typo fixes >>>> you sent upthread. >>>> Regards, >>>> >>> >>> Great! I'll look at it and post again if there's more. If any of the >>> above extra explanations make it clearer why I suggested some changes >>> you didn't like... >> >> I am attaching a new version. I found a small bug in test case 001 >> when checking if standby 2 has caught up. There is also this dump >> function that is helpful. The -i switch in cp command has been removed >> as well. >> > > I'm sorry I didn't review the code, but honestly my perl is so rusty I'm > afraid I'll embarrass myself :) I don't pretend mine are good :) So we are two. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers