On 14/10/2013 5:47 μμ, Toni Mueller wrote:
did you investigate disk I/O?
Hi again, Thanks for your suggestions (see below on that).In the meantime, we have increased CPU power to 4 cores and the behavior of the server is much better.
I found that the server performance was reaching a bottleneck (by php-fpm) by NOT using microcache, because most pages were returning codes 303 502 (and these return codes were not included in fastcgi_cache_valid by default). When I set:
fastcgi_cache_valid 200 301 302 303 502 3s;then I saw immediate performance gains and drop to unix load down to almost 0 (from 100 - not a typo -) during load.
I used iostat during a load test and I didn't see any serious stress on I/O. The worst (max load) recorded entry is:
========================================================================================================== avg-cpu: %user %nice %system %iowait %steal %idle 85.43 0.00 12.96 0.38 0.00 1.23Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
vda 0.00 136.50 0.00 21.20 0.00 1260.00 59.43 1.15 54.25 3.92 8.30 dm-0 0.00 0.00 0.00 157.50 0.00 1260.00 8.00 13.39 85.04 0.53 8.29 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ==========================================================================================================Can you see a serious problem here? (I am not an expert, but, judging from what I've read on the Internet, it should not be bad.)
Now my problem is that there seems to be a limit of performance to around 1200 req/sec (which is not too bad, anyway), although CPU and memory is ample during all test. Increasing stress load more than that (I am using tsung for load testing), results only to increasing "error_connect_emfile" errors.
See results of a test attached. (100 users arriving per second for 5 minutes (with max 10000 users), each of them hitting the homepage 100 times. Details of the test at the bottom of this mail.)
My research showed that this should be a result of file descriptor exhaustion, however I could not find the root cause. The following seem OK:
# cat /proc/sys/fs/file-max 592940 # ulimit -n 200000 # ulimit -Hn 200000 # ulimit -Sn 200000 # grep nofile /etc/security/limits.conf * - nofile 200000Could you please guide me on how to resolve this issue? What is the real bottleneck here and how to overcome?
My config remains as was initially posted (it can also be seen here: https://www.ruby-forum.com/topic/4417776), with the difference of: "worker_processes 4" (since we now have 4 CPU cores).
Please advise.============================= tsung.xml <start> =============================
<?xml version="1.0"?> <!DOCTYPE tsung SYSTEM "/usr/share/tsung/tsung-1.0.dtd"> <tsung loglevel="debug" dumptraffic="false" version="1.0"> <clients> <client host="localhost" use_controller_vm="true" maxusers="10000"/> </clients> <servers> <server host="www.example.com" port="80" type="tcp"></server> </servers> <load duration="5" unit="minute"> <arrivalphase phase="1" duration="5" unit="minute"> <users arrivalrate="100" unit="second"/> </arrivalphase> </load> <sessions> <session probability="100" name="hit_en_homepage" type="ts_http"> <for from="1" to="100" var="i"> <request><http url='/' version='1.1' method='GET'></http></request> <thinktime random='true' value='1'/> </for> </session> </sessions> </tsung>============================== tsung.xml <end> ===============================
Thanks and Regards, Nick
<<attachment: graphes-Perfs-rate_tn.png>>
<<attachment: graphes-Users_Arrival-rate_tn.png>>
<<attachment: graphes-Users-simultaneous_tn.png>>
<<attachment: graphes-Errors-rate_tn.png>>
<<attachment: graphes-Perfs-mean_tn.png>>
_______________________________________________ nginx mailing list nginx@nginx.org http://mailman.nginx.org/mailman/listinfo/nginx