Hello everybody, For first i would like to thank Paolo for very nice piece of software :)
I've got some strange problem, nfaccounting daemon is for some reason locking my database, causing high load-average ( 1.2 and up for last 5,10,15 minutes ). It's stops when i kill all the nfacctd daemons ( by killall -9 ). I think it's related to cache purge ( I came to this conclusion after greping & reading source code ). biuro:~# nfacctd -h 2>&1 | head -1 NetFlow Accounting Daemon, nfacctd 0.9.1 ( package built from Debian unstable tree ) Earlier i had the same problem with nfacctd version from Debina stable ( pmacct 0.7.x ). The ChangeLog doesn't say that in newer version was something fixed, so i didn't upgrade ( but I will try to upgrade tomorrow to 0.9.4 or even 0.9.5 ). I'm collecting traffic for almost ~2000 home-users ( the goal is to build stats for >4000 homeusers ), the database backend is Postgres 7.4 ( from Debian stable ). The routers are Linux-based with fprobe running on them. The nfacctd daemon can run up to 1.5 day normal ( very low CPU used by the database ) before it runs "LOCK TABLE", then the DB starts to take CPU. Earlier i've also tested nfacctd with MySQL 4.1 ( MyISAM and InnoDB later ), but switched to Postgres - results are the same. The machine is 1 x Xeon 3.6GHz with HT-enabled and 1GB RAM ( i'm going to upgrade it to 2 x Xeon 3.6GHz + 2GB RAM in max 2 weeks, the machine is going to do several other things some day ). Postgres has fsync option disabled due to performacne reasons. Snapshot of "ps uaxw" for "bad" situation: biuro:~# ps uaxww | grep nfacct root 14252 0.0 1.0 16620 10560 ? Ss Nov19 0:35 /usr/sbin/nfacctd -f /etc/pmacct/nfacctd.conf root 31300 0.0 1.4 20484 14792 ? S Nov19 0:14 /usr/sbin/nfacctd -f /etc/pmacct/nfacctd.conf root 17203 0.0 1.3 20092 14328 ? S 17:45 0:00 /usr/sbin/nfacctd -f /etc/pmacct/nfacctd.conf root 23615 0.0 1.4 20348 14576 ? S 18:15 0:00 /usr/sbin/nfacctd -f /etc/pmacct/nfacctd.conf root 22347 0.0 1.4 20348 14576 ? S 18:45 0:00 /usr/sbin/nfacctd -f /etc/pmacct/nfacctd.conf root 31530 0.0 1.4 20480 14788 ? S 19:15 0:00 /usr/sbin/nfacctd -f /etc/pmacct/nfacctd.conf root 31782 0.0 1.4 20480 14788 ? S 19:45 0:00 /usr/sbin/nfacctd -f /etc/pmacct/nfacctd.conf root 6019 0.0 1.0 16628 10560 ? Ss 19:59 0:00 /usr/sbin/nfacctd -f /etc/pmacct/nfacctd.conf root 21837 0.0 1.3 19312 13456 ? S 19:59 0:00 /usr/sbin/nfacctd -f /etc/pmacct/nfacctd.conf root 24685 0.0 1.0 16276 10360 ? S 19:59 0:00 /usr/sbin/nfacctd -f /etc/pmacct/nfacctd.conf root 3834 0.0 1.2 18652 12952 ? S 20:15 0:00 /usr/sbin/nfacctd -f /etc/pmacct/nfacctd.conf Normally there are only 3-4 nfacctd processes when it works the right way. biuro:~# ps uaxww | grep postgres postgres 14955 0.0 0.3 23516 3968 ? S Nov19 0:00 /usr/lib/postgresql/bin/postmaster -D /var/lib/postgres/data postgres 1516 0.0 0.4 14316 4728 ? S Nov19 0:10 postgres: stats buffer process postgres 23577 0.0 0.3 13456 3852 ? S Nov19 0:07 postgres: stats collector process postgres 19134 11.2 1.3 24332 14344 ? R 17:45 18:55 postgres: pmacct netflow [local] UPDATE postgres 8800 0.0 0.5 24196 5832 ? S 18:15 0:00 postgres: pmacct netflow [local] LOCK TABLE waiting postgres 7435 0.0 0.5 24196 5656 ? S 18:45 0:00 postgres: pmacct netflow [local] LOCK TABLE waiting postgres 20137 0.0 0.5 24196 5692 ? S 19:15 0:00 postgres: pmacct netflow [local] LOCK TABLE waiting postgres 6593 0.0 0.5 24196 5824 ? S 19:45 0:00 postgres: pmacct netflow [local] LOCK TABLE waiting postgres 14043 0.0 0.5 24196 5664 ? S 19:59 0:00 postgres: pmacct netflow [local] LOCK TABLE waiting postgres 27674 0.0 0.5 24196 5840 ? S 20:15 0:00 postgres: pmacct netflow [local] LOCK TABLE waiting pmacct-0.9.1 was self-compiled using: gcc version 3.3.5 (Debian 1:3.3.5-13) glibc 2.3.2.ds1-22 Config file: ------------ daemonize: true debug: false pidfile: /var/run/nfacctd.pid syslog: daemon aggregate[in]: dst_host aggregate[out]: src_host ! FIXME!!! aggregate_filter[in]: dst net 192.168.0.0/16 aggregate_filter[out]: src net 192.168.0.0/16 plugin_pipe_size: 4096000 plugin_buffer_size: 4096 plugins: pgsql[in], pgsql[out] sql_table[in]: acct_in sql_table[out]: acct_out sql_data[in]: typed sql_data[out]: typed sql_data: typed !sql_table_schema[in]: /etc/pmacct/create-table_in.sql !sql_table_schema[out]: /etc/pmacct/create-table_out.sql !sql_multi_values: 512000 sql_db: netflow !sql_host: localhost sql_passwd: password sql_user: pmacct sql_refresh_time: 1800 sql_optimize_clauses: true sql_history: 1h sql_history_roundoff: m sql_recovery_logfile: /var/lib/pmacct/recovery_log sql_table_version: 1 sql_startup_delay: 900 sql_preprocess: qnum=1000, minp=5 imt_buckets: 65537 imt_mem_pools_size: 1024000 nfacctd_port: 2055 nfacctd_ip: a.b.c.d nfacctd_allow_file: /etc/pmacct/allow ---end--- >From the database: netflow=> select * from pg_stat_activity; datid | datname | procpid | usesysid | usename | current_query | query_start -------+---------+---------+----------+---------+---------------+------------- 17144 | netflow | 20137 | 100 | pmacct | | 17144 | netflow | 8800 | 100 | pmacct | | 17144 | netflow | 19134 | 100 | pmacct | | 17144 | netflow | 15412 | 100 | pmacct | | 17144 | netflow | 6593 | 100 | pmacct | | 17144 | netflow | 7435 | 100 | pmacct | | 17144 | netflow | 27674 | 100 | pmacct | | 17144 | netflow | 14043 | 100 | pmacct | | netflow=> select count(*) from acct_out; count ------- 15572 (1 row) netflow=> select count(*) from acct_in; count -------- 111030 (1 row) netflow=> select stamp_inserted from acct_in group by stamp_inserted order by stamp_inserted; stamp_inserted --------------------- 2005-11-19 00:00:00 2005-11-19 01:00:00 2005-11-19 08:00:00 2005-11-19 09:00:00 2005-11-19 10:00:00 2005-11-19 11:00:00 2005-11-19 12:00:00 2005-11-19 13:00:00 2005-11-19 14:00:00 2005-11-19 15:00:00 2005-11-19 16:00:00 2005-11-19 17:00:00 2005-11-19 18:00:00 2005-11-19 19:00:00 2005-11-19 20:00:00 2005-11-19 21:00:00 2005-11-19 22:00:00 2005-11-19 23:00:00 2005-11-20 00:00:00 2005-11-20 01:00:00 2005-11-20 02:00:00 2005-11-20 03:00:00 2005-11-20 04:00:00 2005-11-20 05:00:00 2005-11-20 06:00:00 2005-11-20 07:00:00 2005-11-20 08:00:00 2005-11-20 09:00:00 2005-11-20 10:00:00 2005-11-20 11:00:00 2005-11-20 12:00:00 2005-11-20 13:00:00 2005-11-20 14:00:00 2005-11-20 15:00:00 2005-11-20 16:00:00 2005-11-20 17:00:00 Stracing one (staled?) process (strace -e recv,send -p 17203 -o log -s 1024), shows that: recv(7, "C\0\0\0\rUPDATE 0\0Z\0\0\0\5T", 16384, 0) = 20 send(7, "Q\0\0\0\262INSERT INTO acct_in (stamp_updated, stamp_inserted, ip_dst, packets, bytes) VALUES (ABSTIME(1132505101): :Timestamp, ABSTIME(1132502400)::Timestamp, \'192.168.87.215\', 6, 288)\0", 179, 0) = 179 recv(7, "C\0\0\0\24INSERT 147356 1\0Z\0\0\0\5T", 16384, 0) = 27 send(7, "Q\0\0\0\316UPDATE acct_in SET packets=packets+8, bytes=bytes+384, stamp_updated=CURRENT_TIMESTAMP(0) WHERE ABSTIME( 1132502400)::Timestamp::Timestamp without time zone = stamp_inserted AND ip_dst=\'192.168.238.113\'\0", 207, 0) = 207 recv(7, "C\0\0\0\rUPDATE 0\0Z\0\0\0\5T", 16384, 0) = 20 send(7, "Q\0\0\0\263INSERT INTO acct_in (stamp_updated, stamp_inserted, ip_dst, packets, bytes) VALUES (ABSTIME(1132505101): :Timestamp, ABSTIME(1132502400)::Timestamp, \'192.168.238.113\', 8, 384)\0", 180, 0) = 180 recv(7, "C\0\0\0\24INSERT 147357 1\0Z\0\0\0\5T", 16384, 0) = 27 send(7, "Q\0\0\0\315UPDATE acct_in SET packets=packets+9, bytes=bytes+432, stamp_updated=CURRENT_TIMESTAMP(0) WHERE ABSTIME( 1132502400)::Timestamp::Timestamp without time zone = stamp_inserted AND ip_dst=\'192.168.204.19\'\0", 206, 0) = 206 recv(7, "C\0\0\0\rUPDATE 0\0Z\0\0\0\5T", 16384, 0) = 20 send(7, "Q\0\0\0\262INSERT INTO acct_in (stamp_updated, stamp_inserted, ip_dst, packets, bytes) VALUES (ABSTIME(1132505101): :Timestamp, ABSTIME(1132502400)::Timestamp, \'192.168.204.19\', 9, 432)\0", 179, 0) = 179 recv(7, "C\0\0\0\24INSERT 147358 1\0Z\0\0\0\5T", 16384, 0) = 27 send(7, "Q\0\0\0\316UPDATE acct_in SET packets=packets+6, bytes=bytes+288, stamp_updated=CURRENT_TIMESTAMP(0) WHERE ABSTIME( 1132502400)::Timestamp::Timestamp without time zone = stamp_inserted AND ip_dst=\'192.168.178.240\'\0", 207, 0) = 207 recv(7, "C\0\0\0\rUPDATE 0\0Z\0\0\0\5T", 16384, 0) = 20 send(7, "Q\0\0\0\263INSERT INTO acct_in (stamp_updated, stamp_inserted, ip_dst, packets, bytes) VALUES (ABSTIME(1132505101): :Timestamp, ABSTIME(1132502400)::Timestamp, \'192.168.178.240\', 6, 288)\0", 180, 0) = 180 recv(7, "C\0\0\0\24INSERT 147359 1\0Z\0\0\0\5T", 16384, 0) = 27 send(7, "Q\0\0\0\317UPDATE acct_in SET packets=packets+12, bytes=bytes+576, stamp_updated=CURRENT_TIMESTAMP(0) WHERE ABSTIME (1132502400)::Timestamp::Timestamp without time zone = stamp_inserted AND ip_dst=\'192.168.242.166\'\0", 208, 0) = 208 recv(7, "C\0\0\0\rUPDATE 0\0Z\0\0\0\5T", 16384, 0) = 20 send(7, "Q\0\0\0\264INSERT INTO acct_in (stamp_updated, stamp_inserted, ip_dst, packets, bytes) VALUES (ABSTIME(1132505101): :Timestamp, ABSTIME(1132502400)::Timestamp, \'192.168.242.166\', 12, 576)\0", 181, 0) = 181 recv(7, "C\0\0\0\24INSERT 147360 1\0Z\0\0\0\5T", 16384, 0) = 27 send(7, "Q\0\0\0\315UPDATE acct_in SET packets=packets+5, bytes=bytes+240, stamp_updated=CURRENT_TIMESTAMP(0) WHERE ABSTIME( 1132502400)::Timestamp::Timestamp without time zone = stamp_inserted AND ip_dst=\'192.168.85.197\'\0", 206, 0) = 206 recv(7, "C\0\0\0\rUPDATE 0\0Z\0\0\0\5T", 16384, 0) = 20 send(7, "Q\0\0\0\262INSERT INTO acct_in (stamp_updated, stamp_inserted, ip_dst, packets, bytes) VALUES (ABSTIME(1132505101): :Timestamp, ABSTIME(1132502400)::Timestamp, \'192.168.85.197\', 5, 240)\0", 179, 0) = 179 recv(7, "C\0\0\0\24INSERT 147361 1\0Z\0\0\0\5T", 16384, 0) = 27 send(7, "Q\0\0\0\316UPDATE acct_in SET packets=packets+6, bytes=bytes+288, stamp_updated=CURRENT_TIMESTAMP(0) WHERE ABSTIME( 1132502400)::Timestamp::Timestamp without time zone = stamp_inserted AND ip_dst=\'192.168.197.235\'\0", 207, 0) = 207 Notice the "packets=packets+6, bytes=bytes+288", it is duplicated. When I was running MySQL/InnoDB i saw after "show processlist" that when the overload happened i was always having the same query which tried to increment the bytes field always by 288 ( the query was interlaced among others ). Any suggestions or maybe i'm doing something wrong ? -- Jakub Wartak -vnull FreeBSD/OpenBSD/Linux/Solaris/Network Administrator http://vnull.pcnet.com.pl/
