On Wed, 5 Jun 2013, Roberto Giordani wrote:
Hello David,
your last email was for me very useful and I have appreciated the
explanation !!! :-)
I can't go into a lot of depth right now, so I'll reply more later
I've changed client and server configuration after benchmark test as
you can see, but I'm still suffering on server side
1)Attached the two .png about band-with available between client to
server (about 200/300 Mbits/s)
2)The thread on server that use 100% CPU is MainQ, so I've increased
thread # to 20 and now all thread works at 20% using all CPUs
try doing a strace on that thread for a few seconds and see what it's
spending it's time doing
the fact that all threads together aren't passing 100% sounds like you
may have some common bottleneck.
3)I've reduced the number of input file for each client, selecting the
most used and on the server the action and attached there is the
rsyslog-pstat of server fsp01
4)this is client conf
$MainMsgQueueType LinkedList
$MainMsgQueueSize 1000000
#$MainMsgQueueWorkerThreads 2
$MainMsgQueueDequeueBatchSize 5000
$MainMsgQueueSaveOnShutdown on
$ActionResumeRetryCount -1 # infinite retries if host is down
$ActionQueueTimeoutEnqueue 1 # after it fills up, also the main queue
fills up
$ActionSendResendLastMsgOnReconnect on # the last message is to be
resend when a connection breaks and has been reconnected.
$ActionQueueCheckpointInterval 1 #
local6.debug @@fsp01:10516
5)this is server conf
# Provides TCP syslog reception
$ModLoad imtcp
$InputTCPServerRun 10514
$InputTCPServerRun 10515
$InputTCPServerRun 10516
#### GLOBAL DIRECTIVES ####
# Use default timestamp format
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat
# File syncing capability is disabled by default. This feature is
usually not required,
# not useful and an extreme performance hit
$ActionFileEnableSync off
$umask 0007
$MainMsgQueueType LinkedList
$MainMsgQueueSize 1000000
$MainMsgQueueWorkerThreads 20
$MainMsgQueueDequeueBatchSize 5000
$MainMsgQueueSaveOnShutdown on
And each config file included to server conf has 5 actions
the actions could be the bottleneck.
When I tested, FixedArray (the default) was faster than LinkedList. Yes,
it may use more memory, but on a dedicated server this is reasonable
(and you want to tune the max size to avoid paging in any case, and a
fixed allocation is easier to check)
My questions today are:
a)each client has a InputFileStateFile, but my application rotate own
files at 00:00. Does rsyslog after logrotate use the new input file
empty?
I'm not sure what you are referring to, but I believe that rsyslog will
find the new file and start from the beginning of it.
b)if for some reason I need to start rsyslog client on a new server
that already have an input file with a size of 300M, is there a way to
start logging only the new line inserted in that file? I think the
answer is no :-)
I think that you are correct
c)you said that each client use a thread for an input file, but I see
that it is starting to send one file and don't send multiple message
from different input file
How can increase sent process? The server has 20 threads to receive
msg.
I think that what happens is that rsyslog starts on a file and does all
the work for the one file before moving to the next. After it's up to
date on each file it will cycle through them looking for new data
d)Due linkedlist mode for msgqueue and actionqueue I don't want to
lose any msg but if I set InputFilePersistStateInterval to 10 I have a
lot of I/O so I have set it to 10000 but it is correct?
if you are talking about data on the order of 100,000 logs/sec, set the
interval to at least 10,000 if not 100.000 you don't want to have to
write this data out frequently, every second or every few seconds should
be good enough. the reason for this file is so that if rsyslog crashes
it will know roughly where it left off in the file. Under conditions
like that, resending a few seconds of a file seems reasonable to me.
How can I prevent msg lost without using diskqueue?
e)how can "Disabling the disk fallback..." ?
well, if you are getting all your data from files, if you just disable
throwing away messages and retry forever the reading of the file will
block if rsyslog can't deliver the messages, do the files act as your
fallback. no need to read them into rsyslog only to have rsyslog write
them back out to disk again (assuming that you fix the problem before
you rotate the files twice at least)
Just a note that writng to a file and then reading it is pretty
inefficient, if you can arrange direct delivery to rsyslog you will
probably like it.
David Lang
Thanks a lot for you support.
Regards,
Roberto.
On 06/04/2013 05:53 PM, David Lang wrote:
On Tue, 4 Jun 2013, Roberto Giordani wrote:
Hello David,
according to your suggestion this is my changes:
1) I've removed ActionQueue on client
2)added on server
$MainMsgQueueType LinkedList
$MainMsgQueueSize 100000
$MainMsgQueueWorkerThreads 3
3)the dns lookup was already disabled with -x option on starting
rsyslog
4)I can't use UDP because my next step will be to encrypt messages
and this is the pstats report after 4 hours on server
2013-06-04T13:09:26.398752+02:00 fsp01 rsyslogd-pstats:
imtcp(10514): submitted=4583409
2013-06-04T13:09:26.773487+02:00 fsp01 rsyslogd-pstats: main Q:
size=9969 enqueued=5082093 full=158414 discarded.full=0
discarded.nf=0 maxqsize=10000
now the server use 3 of 4 CPU available
I've tested locally the client before forward to server and I don't
observe any issue.
I would like to know from you some consideration about this
parameters:
a)$MainMsgQueueDequeueBatchSize
b)$InputFilePersistStateInterval
-)Which value is correct for previous parameters using a
$MainMsgQueueSize 1000000 ?
you should not need to change these values from default
I didn't receive an answer about my first question
-)How is possible to increase incoming messages and handle quickly?
lots of things could be the problem, but we need to find out where
the bottleneck is in your system
-)how can retrieve the number of messages sent and the number of
message recieved?
The best way to do this is to setup a benchmark test where you know
the number of messages that are sent and can measure the number
received.
-)do you think using more than one listener on the server is better?
only if the listener is the bottleneck and you have multiple senders.
If the server is now using 3 out of 4 CPUs, then it's possible that
one of the CPUs is maxing out and that is limiting your overall
throughput. With a simple "receive and write to disk" configuration,
rsyslog probably only uses a few threads, so it may not need the 4th
cpu.
whn you look at top with threads enabled (after hitting 'H') on the
server, is one of the syslog threads using 100% cpu (or close to it)?
Some of the places you could be having bottlnecks
1. the receiving thread could be bottlnecking, in which case using
the ptcp could be a win
2. activity on the main queue could be the bottleneck, in which case
allowing larger batch sizes could help. Disabling the disk fallback
could help, using array instead of linkedlist could help.
3. output to the disk could be the bottlneck, depending on what
filesystem you are using, how large the files get before you rotate
them, and what fsync activity is taking place on the filesystem could
all cause problems. Dynamic file names do have some overhead
setup a test file that has a few hundred million log messages in it.
Then see how long it takes to send that file. If you have enough RAM
on the central box, set your max queue size very large (1M or so),
repeat the test, looking specifically to see if your sending box is
finishing noticably before the receiving box (if so, the bottleneck
is in the output not the input)
While I understand that you are looking to add encryption to your
connection in the near future, do a test with UDP instead of TCP just
to see if that is the bottleneck
Sometimes I find it useful to do a strace on threads that are using a
lot of cpu as the system calls that are being made let me know what
that particular thread is working on.
David Lang
Regards,
Roberto.
On 06/04/2013 08:32 AM, David Lang wrote:
On Mon, 3 Jun 2013, Roberto Giordani wrote:
Hello,
I'm working on a project where 20 servers RHEL 5.8 (with rsyslog
5.8.12) has 20 input files on input and send about 10.000 messages
for
second to one rsyslog server 7.2 version
The network is gigabit between client-->server and this is the
daily
nmon network traffic graphs
The first configuration was with queue file, but is was too
slow, so
I've used the LinkedList queue.
in my testing memory was the fastest (around 1000x as fast as a full
disk queue)
The current client settings are
$InputFileName /file1.log
$InputFileTag TEST1
$InputFileStateFile file1
$InputFileSeverity debug
$InputFileFacility local6
$InputRunFileMonitor
$InputFilePersistStateInterval 10
....
....
....
############################################
$MaxMessageSize 9000
$MainMsgQueueType LinkedList
$MainMsgQueueSize 1000000
$MainMsgQueueWorkerThreads 20
$MainMsgQueueDequeueBatchSize 5000
$MainMsgQueueSaveOnShutdown on
$ActionQueueType LinkedList
$ActionQueueSize 2000000
$ActionQueueWorkerThreads 50
$ActionQueueDequeueBatchSize 5000
$ActionQueueSaveOnShutdown on
###########################################
$ActionResumeRetryCount -1
$ActionQueueTimeoutEnqueue 1
$ActionSendResendLastMsgOnReconnect on
$ActionQueueCheckpointInterval 1
local6.debug @@10.10.1.10:10514
The server settings are
##############################################
# Provides TCP syslog reception
$ModLoad imtcp
$InputTCPServerRun 10514
#### GLOBAL DIRECTIVES ####
# Use default timestamp format
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat
# File syncing capability is disabled by default. This feature is
usually not required,
# not useful and an extreme performance hit
$ActionFileEnableSync off
$umask 0007
if $fromhost-ip == "10.10.1.7" and $syslogfacility-text == "local6"
and $syslogseverity-text == "debug" and $syslogtag == "TEST1" then
action(type="omfile" DirOwner="user1" DirCreateMode="0750"
FileCreateMode="0444" File="/rsyslog-data/file1.log")
if $fromhost-ip == ........
###############################
My questions are:
1)how to find the right combination of
Main queue ---> Action Queue---> receiver queue on rsyslog server ?
why do you need an Action Queue? it's only needed if you have some
reason to think that that particular action is going to fall behind
and
you want to decouple it from the rest of the outputs.
2)Is it possible to increase incoming messages and handle quickly?
Yes, I've processed 380K logs/sec (gig-e limit) and others with 10G
networks have processed 1M logs/sec
3)Why the output log on Centralized Rsyslog are still behind the
source log on the client?
lots of possible reasons, but first off what are you looking at
that's
saying that the central servers are behind?
start off by seeing if you are having problems reading the data
from the
file. setup a test where you just read from the file and write back
out
to the file (I'd suggest starting without any queue directives at
all)
then if that's fast enough, change it to send remotely and report
what's
happening
I've activated the pstats module to understand the queue status on
clients and server.
After 6 hours running the client has this report
.........
2013-06-03T22:23:16.708288+02:00 app01 rsyslogd-pstats: action 9
queue: size=2000000 enqueued=9327675 full=6831507 maxqsize=20
00000
2013-06-03T22:23:16.708297+02:00 app01 rsyslogd-pstats: main Q:
size=2
enqueued=9648448 full=0 maxqsize=20395
2013-06-03T22:23:46.708367+02:00 app01 rsyslogd-pstats: imuxsock:
submitted=323414 ratelimit.discarded=0 ratelimit.numratelimi
ters=29974
2013-06-03T22:23:46.708382+02:00 app01 rsyslogd-pstats: action 9
queue: size=2000000 enqueued=9340578 full=6844410 maxqsize=20
00000
2013-06-03T22:23:46.708390+02:00 app01 rsyslogd-pstats: main Q:
size=2164 enqueued=9666464 full=0 maxqsize=20395
2013-06-03T22:24:16.708923+02:00 app01 rsyslogd-pstats: imuxsock:
submitted=328198 ratelimit.discarded=0 ratelimit.numratelimi
ters=29986
2013-06-03T22:24:16.708941+02:00 app01 rsyslogd-pstats: action 9
queue: size=2000000 enqueued=9355649 full=6859481 maxqsize=20
00000
2013-06-03T22:24:16.708949+02:00 app01 rsyslogd-pstats: main Q:
size=1364 enqueued=9686593 full=0 maxqsize=20395
2013-06-03T22:24:46.709300+02:00 app01 rsyslogd-pstats: imuxsock:
submitted=333070 ratelimit.discarded=0 ratelimit.numratelimi
ters=29997
2013-06-03T22:24:46.709316+02:00 app01 rsyslogd-pstats: action 9
queue: size=2000000 enqueued=9365276 full=6869108 maxqsize=20
00000
2013-06-03T22:24:46.709323+02:00 app01 rsyslogd-pstats: main Q:
size=2123 enqueued=9702047 full=0 maxqsize=20395
2013-06-03T22:25:16.709807+02:00 app01 rsyslogd-pstats: imuxsock:
submitted=337951 ratelimit.discarded=0 ratelimit.numratelimi
ters=30009
2013-06-03T22:25:16.709823+02:00 app01 rsyslogd-pstats: action 9
queue: size=2000000 enqueued=9379492 full=6883324 maxqsize=20
00000
2013-06-03T22:25:16.709832+02:00 app01 rsyslogd-pstats: main Q:
size=2
enqueued=9719723 full=0 maxqsize=20395
2013-06-03T22:25:46.709942+02:00 app01 rsyslogd-pstats: imuxsock:
submitted=343014 ratelimit.discarded=0 ratelimit.numratelimi
ters=30021
2013-06-03T22:25:46.709980+02:00 app01 rsyslogd-pstats: action 9
queue: size=2000000 enqueued=9389640 full=6893472 maxqsize=20
00000
and the rsyslog server the following stats
.......
2013-06-03T23:05:15.898682+02:00 fsp01 rsyslogd-pstats: main Q:
size=9978 enqueued=5032165 full=156941 discarded.full=0
discarded.nf=0
maxqsize=10000
2013-06-03T23:06:56.157199+02:00 fsp01 rsyslogd-pstats: main Q:
size=9987 enqueued=5065134 full=157971 discarded.full=0
discarded.nf=0
maxqsize=10000
2013-06-03T23:08:30.657673+02:00 fsp01 rsyslogd-pstats: main Q:
size=9972 enqueued=5096315 full=158942 discarded.full=0
discarded.nf=0
maxqsize=10000
2013-06-03T23:10:09.895850+02:00 fsp01 rsyslogd-pstats: main Q:
size=9986 enqueued=5129162 full=159969 discarded.full=0
discarded.nf=0
maxqsize=10000
2013-06-03T23:11:42.488505+02:00 fsp01 rsyslogd-pstats: main Q:
size=9973 enqueued=5159935 full=160933 discarded.full=0
discarded.nf=0
maxqsize=10000
2013-06-03T23:13:23.213800+02:00 fsp01 rsyslogd-pstats: main Q:
size=9973 enqueued=5193246 full=161973 discarded.full=0
discarded.nf=0
maxqsize=10000
2013-06-03T23:14:58.833570+02:00 fsp01 rsyslogd-pstats: main Q:
size=9970 enqueued=5224922 full=162962 discarded.full=0
discarded.nf=0
maxqsize=10000
2013-06-03T23:16:35.184133+02:00 fsp01 rsyslogd-pstats: main Q:
size=9975 enqueued=5256863 full=163960 discarded.full=0
discarded.nf=0
maxqsize=10000
2013-06-03T23:18:13.992958+02:00 fsp01 rsyslogd-pstats: main Q:
size=9991 enqueued=5289392 full=164977 discarded.full=0
discarded.nf=0
maxqsize=10000
2013-06-03T23:19:52.464473+02:00 fsp01 rsyslogd-pstats: main Q:
size=9942 enqueued=5322013 full=165996 discarded.full=0
discarded.nf=0
maxqsize=10000
4)Why on the client enqueued value never decrease and full= is
always
different of 0?
well, once you have enqueued something, you don't go back in time
and
not enqueue it, so unless you restart rsyslog, that number is
going to
just keep climbing.
I'm planning to distribute the 20 client to 2 process of rsyslog
server on different port on the same server but I think your help
about the right combination of action queue on client and main
queue
on server.
On rsyslog documentation I've found a lot of info about to handle
input queue (# thread, max messages and DequeuBatch) but I dont'
know
how to rsyslog should better work on receiver server.
I've seen that the rsyslog 7.2 process on server use only one CPU
each
time, while the server has 4CPU and 12GB ram on 64bit O.S.
5)Some option during configuration process can change this
behavior ?
$ ./configure --prefix=/usr/local/rsyslog7 --enable-gnutls
--enable-imfile --enable-omruleset --enable-imptcp
--enable-mmnormalize --enable-usertools --enable-imdiag
--enable-diagtools --enable-impstats
I think you are trying too hard.
rsyslog will use one thread for each input, and one thread for each
output by default (and possibly some others along the way if it
sees a
need), the fact that you aren't seeing it use more than 100% CPU
indicates that you probably have some other bottleneck.
On the recieving side, have you disabled DNS lookups? that has a
big hit
in many cases.
when you look at top of the central box, hit 'H' to have it display
the
different threads as different lines, then you will see how much
cpu is
used for each one. After that you can track down which one is the
bottleneck
David Lang