i have few more details to share from today’s testing. Attached strace to the process and noticed that there are thousands of short lived child processes have been created by the stream application. Not sure whether rocksdb is playing any role here. Noticed more than 73000 child processes were created and this is after i increased the default linux system limit from 32767 to 4million per PID. Appreciate if someone answer the following questions. 1. Why rocksdb creates these many processes? Each process is trying to read the contents of the file and getting EAGAIN (Resource temporarily unavailable) 2. Noticed very high number of .checkpoint files missing and hence “No such file or directory” message 3. The child processes were created using “clone” system call. Not sure whether there is any significance of this but wanna mention. Moreover these child processes were keep on increasing when the load is getting applied continuously. Not sure under what condition it will hit the ceiling. 4. Noticed very high VIRT memory usage (around 31G) but RES memory was only 1.9G. Isn’t every file open/close consume memory? But it is not going up even though the number of processes is increased.
Thanks Thameem > On Jul 4, 2019, at 9:28 AM, Thameem Ansari <thame...@gmail.com> wrote: > > Sorry for reposting the previous message as the images didn’t come thru.. > pasting as text. > > I have changed both system and user limits. > > To completely isolate the problem, I have tried the application in Centos 7 > environment. Set the ulimit to 1million and system limits to 10million open > files. Now 3 kafka nodes are running in separate servers each and streaming > application is running in a dedicated VM. Now the application is not > explicitly throwing “too many open files” error but automatically aborted > with the message > > terminate called after throwing an instance of 'std::system_error' > what(): Resource temporarily unavailable > ./bin/start.sh: line 42: 1100 Aborted > > Here is the last few lines from strace output which shows the aborted > message. > > 25721 14:58:35 > open("/home/devops/data/kafka-streams/RawLog_Processor/393_7/MAIL.INCOMING_FILTER_BY_USER_MAIL_INCOMING_DOMAIN_TRAFFIC_DETAIL.DETECTED_CONTENT_FILTER_RECIPIENT_DOMAIN/MAIL.INCOMING_FILTER_BY_USER_MAIL_INCOMING_DOMAIN_TRAFFIC_DETAIL.DETECTED_CONTENT_FILTER_RECIPIENT_DOMAIN.1555200000/MANIFEST-000007", > O_RDONLY|O_CLOEXEC) = 12505 > 25721 14:58:35 > open("/sys/devices/virtual/block/dm-2/queue/logical_block_size", O_RDONLY) = > 12506 > 25721 14:58:35 read(12506, "512\n", 4096) = 4 > 25721 14:58:35 close(12506) = 0 > 25721 14:58:35 write(12502, "s.advise_random_on_open: 0\n2019/"..., 4096) = > 4096 > 25721 14:58:35 write(12502, "ions.comparator: leveldb.Bytewis"..., 4096) = > 4096 > 25721 14:58:35 read(12505, > "V\371\270\370\34\0\1\1\32leveldb.BytewiseCompara"..., 32768) = 192 > 25721 14:58:35 read(12505, "", 28672) = 0 > 25721 14:58:35 close(12505) = 0 > 17701 14:58:35 > open("/home/devops/data/kafka-streams/RawLog_Processor/393_7/MAIL.INCOMING_FILTER_BY_USER_MAIL_INCOMING_DOMAIN_TRAFFIC_DETAIL.DETECTED_CONTENT_FILTER_RECIPIENT_DOMAIN/MAIL.INCOMING_FILTER_BY_USER_MAIL_INCOMING_DOMAIN_TRAFFIC_DETAIL.DETECTED_CONTENT_FILTER_RECIPIENT_DOMAIN.1555200000/000006.sst", > O_RDONLY|O_CLOEXEC) = 12505 > 17702 14:58:35 +++ exited with 0 +++ > 25721 14:58:35 write(2, "terminate called after throwing "..., 48) = 48 > 25721 14:58:35 write(2, "std::system_error", 17) = 17 > 25721 14:58:35 write(2, "'\n", 2) = 2 > 25721 14:58:35 write(2, " what(): ", 11) = 11 > 25721 14:58:35 write(2, "Resource temporarily unavailable", 32) = 32 > 25721 14:58:35 write(2, "\n", 1) = 1 > 17701 14:58:35 > open("/sys/devices/virtual/block/dm-2/queue/logical_block_size", O_RDONLY) = > 12506 > 25721 14:58:35 write(12502, "ction_dynamic_level_bytes: 0\n201"..., 3672) = > 3672 > 25721 14:58:35 --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=25697, > si_uid=1000} --- > 17701 14:58:35 read(12506, <unfinished ...> > 17701 14:58:36 +++ killed by SIGABRT +++ > 17700 14:58:36 +++ killed by SIGABRT +++ > 17699 14:58:36 +++ killed by SIGABRT +++ > > > As I can see from the open files they are way lower (45201) than the max > limit and hence can we rule out the “open files” is the root cause? > > I have also noticed there is lot of EAGAIN (count 3334) and associated > “Resource temporarily unavailable” messages as well > > > > 25732 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25732 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25732 14:49:23 close(59) = 0 > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25709 14:49:23 close(59) = 0 > 25721 14:49:23 write(50, "\0", 1) = 1 > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25709 14:49:23 close(59) = 0 > 25721 14:49:23 read(17, > "PK\3\4\n\0\0\10\10\0\343SHNM\n4>Q\2\0\0\265\5\0\0A\0\0\0", 30) = 30 > 25721 14:49:23 read(17, > "\245T]O\23A\24=\323\226\326-\253\255\250\365\v\4\265ji\221E\344\305\200M\f\201\304\244j"..., > 593) = 593 > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25709 14:49:23 close(59) = 0 > 25732 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25732 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25732 14:49:23 close(59) = 0 > 25721 14:49:23 read(49, "\0", 16) = 1 > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25709 14:49:23 close(59) = 0 > 25732 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25732 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25732 14:49:23 close(59) = 0 > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25709 14:49:23 close(59) = 0 > 25721 14:49:23 read(54, "\0\1dx", 4) = 4 > 25721 14:49:23 read(54, > "\0\0\0\v\0\0\0\0\0\0\0\1dj\0\0\0\0\1\212\0\201RawLog_Pro"..., 91256) = 91256 > 25721 14:49:23 read(54, 0x7f7044003b80, 4) = -1 EAGAIN (Resource temporarily > unavailable) > 25732 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25732 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25721 14:49:23 write(35, "2019-07-03 14:49:23.258 [rawLP1-"..., 219 > <unfinished ...> > 25732 14:49:23 close(59) = 0 > 25721 14:49:23 <... write resumed> ) = 219 > 25721 14:49:23 write(1, "2019-07-03 14:49:23.258 [rawLP1-"..., 219) = 219 > 25721 14:49:23 write(50, "\0", 1) = 1 > 25721 14:49:23 read(49, "\0", 16) = 1 > 25708 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 60 > 25708 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25708 14:49:23 close(59) = 0 > 25709 14:49:23 read(60, "9223372036854775807\n", 4096) = 20 > 25709 14:49:23 close(60) = 0 > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25709 14:49:23 close(59) = 0 > 25708 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25708 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25708 14:49:23 close(59) = 0 > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25709 14:49:23 close(59) = 0 > 25732 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25732 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > 25732 14:49:23 close(59) = 0 > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes", O_RDONLY) > = 59 > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20 > > But unable to come to any conclusion from these findings. Any insight is > appreciated. > > Thanks > Thameem > >> On Jul 4, 2019, at 9:18 AM, Thameem Ansari <thame...@gmail.com >> <mailto:thame...@gmail.com>> wrote: >> >> <strace2.png> >> >> >> But unable to come to any conclusion from these findings. Any insight is >> appreciated. >> >> Thanks >> Thameem >