Hi I ran in to the issue again with Tika/Java taking more CPU, up to 200+ CPU%. The scenario is that i have 3-4 long running processes calling Tika server (Version 1.24) and occassionaly 3-4 additional shorter processes (2-3 hours) starts up and calls the Tika server. The scenario is being run for a couple of days, extracting text from various types of documents.
The Tika server is running locally. Top shows this: ---------------------------------------------------------------------------- ---------------------- top - 16:21:17 up 5 days, 8:12, 6 users, load average: 2,64, 2,63, 2,61 Tasks: 145 total, 1 running, 144 sleeping, 0 stopped, 0 zombie %Cpu(s): 50,8 us, 0,3 sy, 0,0 ni, 48,8 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st KiB Mem : 4032128 total, 129052 free, 2702236 used, 1200840 buff/cache KiB Swap: 4192252 total, 2968864 free, 1223388 used. 1040340 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 911 root 20 0 4578604 1,229g 8024 S 204,3 32,0 859:11.22 java 743 root 20 0 196596 5772 920 S 0,7 0,1 35:28.02 wizit_rest 34637 elastic+ 20 0 21,346g 883808 30616 S 0,3 21,9 1250:04 java 1 root 20 0 204620 3440 2376 S 0,0 0,1 0:14.99 systemd 2 root 20 0 0 0 0 S 0,0 0,0 0:00.15 kthreadd 3 root 20 0 0 0 0 S 0,0 0,0 1:46.20 ksoftirqd+ 5 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 kworker/0+ 7 root 20 0 0 0 0 S 0,0 0,0 4:59.14 rcu_sched 8 root 20 0 0 0 0 S 0,0 0,0 0:00.00 rcu_bh 9 root rt 0 0 0 0 S 0,0 0,0 0:03.83 migration+ ---------------------------------------------------------------------------- ---------------------- At first i ran the jstackseries.sh: ---------------------------------------------------------------------------- ---------------------- more jstack.911.202904.163848252 Attaching to process ID 911, please wait... Debugger attached successfully. Server compiler detected. JVM version is 25.242-b08 Deadlock Detection: Can't print deadlocks:Unable to deduce type of thread from address 0x00007f30bc0 2d800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread , or SurrogateLockerThread) ---------------------------------------------------------------------------- ---------------------- It also freeze the system, "systemd[1]: Freezing execution." But i finally got a threaddump via jstack, i attach that file. I also attach the tika-config file in case that also could be useful. Hope this helps to analyze the issue. Kind regards Hans -----Ursprungligt meddelande----- Från: Nick Burch <apa...@gagravarr.org> Skickat: den 16 april 2020 15:40 Till: hans.mei...@avident-it.se Kopia: dev@tika.apache.org Ämne: Re: Issue with > 200% CPU after bulk usage On Wed, 15 Apr 2020, hans.mei...@avident-it.se wrote: > I have encountered an issue with Tika running locally on a box that > the Java runtime goes up to over 200% CPU, after running a bulk load > of documents over a couple of days, it is more than 3 million documents. Can you do a thread dump to show what the JVM is doing? https://access.redhat.com/solutions/18178 Nick
<?xml version="1.0" encoding="UTF-8" standalone="no" ?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <!-- NOTE: tika-batch is still an experimental feature. The configuration file will likely change and be backward incompatible with new versions of Tika. Please stay tuned. --> <tika-batch-config maxAliveTimeSeconds="-1" pauseOnEarlyTerminationMillis="10000" timeoutThresholdMillis="300000" timeoutCheckPulseMillis="1000" maxQueueSize="10000" numConsumers="default"> <!-- numConsumers = number of file consumers, "default" = number of processors -1 --> <!-- options to allow on the commandline --> <commandline> <option opt="c" longOpt="tika-config" hasArg="true" description="TikaConfig file"/> <option opt="bc" longOpt="batch-config" hasArg="true" description="xml batch config file"/> <!-- We needed sorted for testing. We added random for performance. Where crawling a directory is slow, it might be beneficial to go randomly so that the parsers are triggered earlier. The default is operating system's choice ("os") which means whatever order the os returns files in .listFiles(). --> <option opt="crawlOrder" hasArg="true" description="how does the crawler sort the directories and files: (random|sorted|os)"/> <option opt="numConsumers" hasArg="true" description="number of fileConsumers threads"/> <option opt="maxFileSizeBytes" hasArg="true" description="maximum file size to process; do not process files larger than this"/> <option opt="maxQueueSize" hasArg="true" description="maximum queue size for FileResources"/> <option opt="fileList" hasArg="true" description="file that contains a list of files (relative to inputDir) to process"/> <option opt="fileListEncoding" hasArg="true" description="encoding for fileList"/> <option opt="inputDir" hasArg="true" description="root directory for the files to be processed"/> <option opt="startDir" hasArg="true" description="directory (under inputDir) at which to start crawling"/> <option opt="outputDir" hasArg="true" description="output directory for output"/> <!-- do we want to make this mandatory --> <option opt="recursiveParserWrapper" description="use the RecursiveParserWrapper or not (default = false)"/> <option opt="streamOut" description="stream the output of the RecursiveParserWrapper (default = false)"/> <option opt="handleExisting" hasArg="true" description="if an output file already exists, do you want to: overwrite, rename or skip"/> <option opt="basicHandlerType" hasArg="true" description="what type of content handler: xml, text, html, body"/> <option opt="outputSuffix" hasArg="true" description="suffix to add to the end of the output file name"/> <option opt="timeoutThresholdMillis" hasArg="true" description="how long to wait before determining that a consumer is stale"/> <option opt="includeFilePat" hasArg="true" description="regex that specifies which files to process"/> <option opt="excludeFilePat" hasArg="true" description="regex that specifies which files to avoid processing"/> <option opt="reporterSleepMillis" hasArg="true" description="millisecond between reports by the reporter"/> <option opt="digest" hasArg="true" description="which digest(s) to use, e.g. 'md5,sha512'\"/> <option opt="digestMarkLimit" hasArg="true" description="max bytes to read for digest\"/> </commandline> <!-- can specify inputDir="input", but the default config should not include this --> <!-- can also specify startDir="input/someDir" to specify which child directory to start processing --> <crawler builderClass="org.apache.tika.batch.fs.builders.FSCrawlerBuilder" crawlOrder="random" maxFilesToAdd="-1" maxFilesToConsider="-1" includeFilePat="" excludeFilePat="" maxFileSizeBytes="-1" /> <!-- This is an example of a crawler that reads a list of files to be processed from a file. This assumes that the files in the list are relative to inputDir. <crawler class="org.apache.tika.batch.fs.builders.FSCrawlerBuilder" fileList="files.txt" fileListEncoding="UTF-8" maxFilesToAdd="-1" maxFilesToConsider="-1" includeFilePat="(?i).pdf$" excludeFilePat="(?i).msg$" maxFileSizeBytes="-1" inputDir="input" /> --> <!-- To wrap parser in RecursiveParserWrapper (tika-app's -J or tika-server's /rmeta), add attribute recursiveParserWrapper="true" to consumers element. To wrap parser with DigestingParser add attributes e.g.: digest="md5,sha256" digestMarkLimit="10000000" --> <consumers builderClass="org.apache.tika.batch.fs.builders.BasicTikaFSConsumersBuilder" recursiveParserWrapper="false" consumersManagerMaxMillis="60000"> <parser builderClass="org.apache.tika.batch.builders.AppParserFactoryBuilder" class="org.apache.tika.batch.DigestingAutoDetectParserFactory" parseRecursively="true" digest="md5" digestMarkLimit="1000000"/> <contenthandler builderClass="org.apache.tika.batch.builders.DefaultContentHandlerFactoryBuilder" basicHandlerType="xml" writeLimit="-1"/> <!-- can specify custom output file suffix with: suffix=".mysuffix" if no suffix is specified, BasicTikaFSConsumersBuilder does its best to guess --> <!-- can specify compression with compression="bzip2|gzip|zip" --> <outputstream class="FSOutputStreamFactory" encoding="UTF-8"/> </consumers> <!-- reporter and interrupter are optional --> <reporter builderClass="org.apache.tika.batch.builders.SimpleLogReporterBuilder" reporterSleepMillis="1000" reporterStaleThresholdMillis="60000"/> <interrupter builderClass="org.apache.tika.batch.builders.InterrupterBuilder"/> </tika-batch-config>