Re: [pmacct-discussion] nfacct total bytes inconsistencies

Vaggelis Koutroumpas Tue, 01 Dec 2015 10:27:47 -0800

Hello Paolo,

I guess I was wrong about the numbers not being off too much. I had to
wait for more data to be collected. As time passes the total bytes
accounted are getting way off.


What would be the maximum accepted discrepancy in an ideal setup?
I know that there will be differences between SNMP measurements, but how
much difference is considered normal? (I know it's kind of a vague question)

Restarting nfacctd did not change anything.

I also restarted the whole box just in case.
The UDP drop counters still stay unaffected

Udp:
    78879 packets received
    590 packets to unknown port received.
    0 packet receive errors
    25 packets sent

  sl  local_address rem_address   st tx_queue rx_queue tr tm->when
retrnsmt   uid  timeout inode ref pointer drops
  101: 00000000:A1F1 00000000:0000 07 00000000:00000000 00:00000000
00000000     0        0 10613 2 ffff88013a354780 0
  575: 0100007F:2BCB 00000000:0000 07 00000000:00000000 00:00000000
00000000   110        0 10201 2 ffff8800bacfdac0 0
 1720: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000
00000000     0        0 10630 2 ffff88013a354400 0


Regarding the VLAN traffic, I do have VLAN traffic, but Mirkotik does
not export this field as far as I can tell from the netflow template.

DEBUG ( default/core ): NfV9 agent         : X.X.X.X:0
DEBUG ( default/core ): NfV9 template type : flow
DEBUG ( default/core ): NfV9 template ID   : 257
DEBUG ( default/core ):
-----------------------------------------------------
DEBUG ( default/core ): |    pen     |     field type     | offset |
size  |
DEBUG ( default/core ): | 0          | ip version         |      0 |
  1 |
DEBUG ( default/core ): | 0          | IPv6 src addr      |      1 |
 16 |
DEBUG ( default/core ): | 0          | IPv6 src mask      |     17 |
  1 |
DEBUG ( default/core ): | 0          | input snmp         |     18 |
  4 |
DEBUG ( default/core ): | 0          | IPv6 dst addr      |     22 |
 16 |
DEBUG ( default/core ): | 0          | IPv6 dst mask      |     38 |
  1 |
DEBUG ( default/core ): | 0          | output snmp        |     39 |
  4 |
DEBUG ( default/core ): | 0          | IPv6 next hop      |     43 |
 16 |
DEBUG ( default/core ): | 0          | L4 protocol        |     59 |
  1 |
DEBUG ( default/core ): | 0          | tcp flags          |     60 |
  1 |
DEBUG ( default/core ): | 0          | tos                |     61 |
  1 |
DEBUG ( default/core ): | 0          | L4 src port        |     62 |
  2 |
DEBUG ( default/core ): | 0          | L4 dst port        |     64 |
  2 |
DEBUG ( default/core ): | 0          | 31                 |     66 |
  4 |
DEBUG ( default/core ): | 0          | 64                 |     70 |
  4 |
DEBUG ( default/core ): | 0          | last switched      |     74 |
  4 |
DEBUG ( default/core ): | 0          | first switched     |     78 |
  4 |
DEBUG ( default/core ): | 0          | in bytes           |     82 |
  4 |
DEBUG ( default/core ): | 0          | in packets         |     86 |
  4 |
DEBUG ( default/core ): | 0          | in dst mac         |     90 |
  6 |
DEBUG ( default/core ): | 0          | out src mac        |     96 |
  6 |
DEBUG ( default/core ):
-----------------------------------------------------
DEBUG ( default/core ): Netflow V9/IPFIX record size : 102
DEBUG ( default/core ):


What drives me crazy is that if I do controlled data transfers for long
periods of time, nfacctd counts everything properly. I can see the rate
at which the bytes counter increases in the database, with which doing
the calculations results in exactly the mbit/s I am doing transfers at.

So it seems that RouterOS does export the flows properly and nfacctd
does measure the bytes properly.
And yet, when checking the results on another IP (which has normal web
traffic) then the data are always off and getting worse as time goes by.
What's even stranger is that the Download bytes (which is always less in
reality) is measured slightly higher in nfacctd (from a few MB to a few
hundred MB).
While upload data is measured less than what is actually going through
the wire. (from 1GB to 3-4GB less per hour, depending on how much
traffic the server has at any given hour)


Unfortunately the collector box is not accessible from the internet. I
understand that this would help you identify the issue much quicker than
explaining to me every possible solution to try.
I'll try to get permission to allow you access (via VPN or something) if
nothing else works.
I really do appreciate the offer to help! :)


I noticed today some sporadic info messages on nfacctd output.

INFO: expecting flow '657566' but received '657621'
collector=0.0.0.0:2055 agent=X.X.X.X1:0
INFO: expecting flow '657677' but received '657738'
collector=0.0.0.0:2055 agent=X.X.X.X:0


Is this normal? Does that mean that it lost a flow somewhere and that's
why it throws this INFO message?

I have increased the buffers:

plugin_pipe_size:   268435456
plugin_buffer_size: 268435
nfacctd_pipe_size:  268435456

root@netflow:~# cat /proc/sys/net/core/rmem_max
268435456
root@netflow:~# cat /proc/sys/net/core/rmem_default
268435456

I am not sure what would be proper values for these settings.
Are those too small? Too big?

I still get those INFO messages sporadically even with the increased
buffers.
It seems to me that they are already too high (for only 50-150pps of
netflow traffic) to account for missing data.

I also checked the UDP drop counters while I got those messages and they
didn't increase.


Thanks again for your help :)

On 29/11/2015 6:42 μμ, Paolo Lucente wrote:
> Hi Vaggelis,
> 
> In your previous email it seems that for some period of time numbers
> were not 'that off'; is it a behaviour that you manage to reproduce if
> you stop/start nfacctd? I essentially wonder if it was a coincidence
> or there is effectively some degradation.
> 
> Also, i see that you use 'aggregate_filter' in order to split inbound
> from outbound traffic; do you have any VLAN-tagged traffic? If yes,
> then that traffic would not be captured by your current filters. If
> not sure you can check this with a simple pmacct configuration like:
> 
> plugin: memory[test]
> aggregate_filter[test]: vlan
> aggregate[test]: src_host, dst_host, src_port, dst_port, proto, vlan
> 
> When doing a 'pmacct -s' to query the memory table, you should see no
> results (since the NetFlow v9 templates are sent every 60 secs based
> on your RouterOS configuration, please wait some 120 secs before
> drafting conclusions). 
> 
> Should all of this still not bring no anything conclusive, is remote-
> access to your collector box a possibility? If yes, we can follow-up
> privately: i'd be more than happy to have a look myself.
> 
> Cheers,
> Paolo
> 
> On Sun, Nov 29, 2015 at 01:22:34AM +0200, Vaggelis Koutroumpas wrote:
>> It seems that the new server shows the same behavior after all :(
>>
>>
>> mysql> SELECT (
>>     ->     SELECT concat(truncate((sum(bytes)/1024/1024/1024),2), 'GB')
>> as bytes    FROM hourly    WHERE ip_dst = '0.0.0.0' AND stamp_inserted
>> BETWEEN  '2015-11-28 20:00:00'  AND  '2015-11-28 23:59:59'
>>     -> ) as total_out, (
>>     ->     SELECT concat(truncate((sum(bytes)/1024/1024/1024),2), 'GB')
>> as bytes    FROM hourly    WHERE ip_src = '0.0.0.0' AND stamp_inserted
>> BETWEEN  '2015-11-28 20:00:00'  AND  '2015-11-28 23:59:59'
>>     -> ) as total_in;
>> +-----------+----------+
>> | total_out | total_in |
>> +-----------+----------+
>> | 101.03GB  | 15.43GB  |
>> +-----------+----------+
>> 1 row in set (0.05 sec)
>>
>> While at the same time-frame observium reports higher 'total out' and
>> less 'total in' http://prntscr.com/983ers
>>
>> I guess the 'total in' discrepancy is acceptable. But the 'total out' is
>> over 6Gbytes off!
>>
>> If I increase the time-frame then the totals are more off.
>>
>> mysql> SELECT (
>>     ->     SELECT concat(truncate((sum(bytes)/1024/1024/1024),2), 'GB')
>> as bytes    FROM hourly    WHERE ip_dst = '0.0.0.0' AND stamp_inserted
>> BETWEEN  '2015-11-28 19:00:00'  AND  '2015-11-28 23:59:59'
>>     -> ) as total_out, (
>>     ->     SELECT concat(truncate((sum(bytes)/1024/1024/1024),2), 'GB')
>> as bytes    FROM hourly    WHERE ip_src = '0.0.0.0' AND stamp_inserted
>> BETWEEN  '2015-11-28 19:00:00'  AND  '2015-11-28 23:59:59'
>>     -> ) as total_in;
>> +-----------+----------+
>> | total_out | total_in |
>> +-----------+----------+
>> | 129.60GB  | 19.46GB  |
>> +-----------+----------+
>> 1 row in set (0.02 sec)
>>
>> Observium: http://prntscr.com/983nxa
>>
>> Here the 'total out' is 8GBytes off.
>> While 'total in' seems to be a little off but in acceptable range.
>>
>>
>> There are no drops AFAICT.
>>
>> root@netflow:~# netstat -s | grep Udp\: -A 5
>> Udp:
>>     817211 packets received
>>     688 packets to unknown port received.
>>     122 packet receive errors
>>     14971 packets sent
>>     RcvbufErrors: 122
>>
>> Those 122 errors are there for hours (before 20:00:00 of my query).
>>
>> root@netflow:~# cat /proc/net/udp
>>   sl  local_address rem_address   st tx_queue rx_queue tr tm->when
>> retrnsmt   uid  timeout inode ref pointer drops
>>   696: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000
>> 00000000     0        0 10611 2 ffff88007b36c780 0
>>   751: 00000000:307B 00000000:0000 07 00000000:00000000 00:00000000
>> 00000000     0        0 10580 2 ffff88007b36cb00 0
>>
>>
>> I've also installed munin to monitor the performance of the server.
>> MySQL does on average 40 queries/s.
>> The server load is steadily 0.1
>> The avg incoming packets are ~40pps
>>
>> So the server is pretty much idle to lose any data.
>>
>> Any ideas what else to check?
>> What would be an acceptable 'off percentage' of the bytes in comparison
>> with SNMP measurements?
>>
>>
>> Thanks.

_______________________________________________
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists

Re: [pmacct-discussion] nfacct total bytes inconsistencies

Reply via email to