Hi
I ran in to the issue again with Tika/Java taking more CPU, up to 200+ CPU%.
 
The scenario is that i have 3-4 long running processes calling Tika server
(Version 1.24) and occassionaly 3-4 additional shorter processes (2-3 hours)
starts up and calls the Tika server.
The scenario is being run for a couple of days, extracting text from various
types of documents.

The Tika server is running locally.

 
Top shows this:

----------------------------------------------------------------------------
----------------------
top - 16:21:17 up 5 days,  8:12,  6 users,  load average: 2,64, 2,63, 2,61
Tasks: 145 total,   1 running, 144 sleeping,   0 stopped,   0 zombie
%Cpu(s): 50,8 us,  0,3 sy,  0,0 ni, 48,8 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0
st
KiB Mem :  4032128 total,   129052 free,  2702236 used,  1200840 buff/cache
KiB Swap:  4192252 total,  2968864 free,  1223388 used.  1040340 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   911 root      20   0 4578604 1,229g   8024 S 204,3 32,0 859:11.22 java
   743 root      20   0  196596   5772    920 S   0,7  0,1  35:28.02
wizit_rest
 34637 elastic+  20   0 21,346g 883808  30616 S   0,3 21,9   1250:04 java
     1 root      20   0  204620   3440   2376 S   0,0  0,1   0:14.99 systemd
     2 root      20   0       0      0      0 S   0,0  0,0   0:00.15
kthreadd
     3 root      20   0       0      0      0 S   0,0  0,0   1:46.20
ksoftirqd+
     5 root       0 -20       0      0      0 S   0,0  0,0   0:00.00
kworker/0+
     7 root      20   0       0      0      0 S   0,0  0,0   4:59.14
rcu_sched
     8 root      20   0       0      0      0 S   0,0  0,0   0:00.00 rcu_bh
     9 root      rt   0       0      0      0 S   0,0  0,0   0:03.83
migration+
----------------------------------------------------------------------------
----------------------


At first i ran the jstackseries.sh:
----------------------------------------------------------------------------
----------------------
more jstack.911.202904.163848252
Attaching to process ID 911, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.242-b08
Deadlock Detection:

Can't print deadlocks:Unable to deduce type of thread from address
0x00007f30bc0
2d800 (expected type JavaThread, CompilerThread, ServiceThread,
JvmtiAgentThread
, or SurrogateLockerThread)
----------------------------------------------------------------------------
----------------------

It also freeze the system, "systemd[1]: Freezing execution."


But i finally got a threaddump via jstack, i attach that file. I also attach
the tika-config file in case that also could be useful.
Hope this helps to analyze the issue.


Kind regards 
Hans


-----Ursprungligt meddelande-----
Från: Nick Burch <apa...@gagravarr.org> 
Skickat: den 16 april 2020 15:40
Till: hans.mei...@avident-it.se
Kopia: dev@tika.apache.org
Ämne: Re: Issue with > 200% CPU after bulk usage

On Wed, 15 Apr 2020, hans.mei...@avident-it.se wrote:
> I have encountered an issue with Tika running locally on a box that 
> the Java runtime goes up to over 200% CPU, after running a bulk load 
> of documents over a couple of days, it is more than 3 million documents.

Can you do a thread dump to show what the JVM is doing?
https://access.redhat.com/solutions/18178

Nick
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>

<!--
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.
-->
<!-- NOTE: tika-batch is still an experimental feature.
    The configuration file will likely change and be backward incompatible
    with new versions of Tika.  Please stay tuned.
    -->

<tika-batch-config
        maxAliveTimeSeconds="-1"
        pauseOnEarlyTerminationMillis="10000"
        timeoutThresholdMillis="300000"
        timeoutCheckPulseMillis="1000"
        maxQueueSize="10000"
        numConsumers="default"> <!-- numConsumers = number of file consumers, "default" = number of processors -1 -->

    <!-- options to allow on the commandline -->
    <commandline>
        <option opt="c" longOpt="tika-config" hasArg="true"
                description="TikaConfig file"/>
        <option opt="bc" longOpt="batch-config" hasArg="true"
                description="xml batch config file"/>
        <!-- We needed sorted for testing.  We added random for performance.
             Where crawling a directory is slow, it might be beneficial to
             go randomly so that the parsers are triggered earlier.  The
             default is operating system's choice ("os") which means whatever order
             the os returns files in .listFiles(). -->
        <option opt="crawlOrder" hasArg="true"
                description="how does the crawler sort the directories and files:
                                (random|sorted|os)"/>
        <option opt="numConsumers" hasArg="true"
                description="number of fileConsumers threads"/>
        <option opt="maxFileSizeBytes" hasArg="true"
                description="maximum file size to process; do not process files larger than this"/>
        <option opt="maxQueueSize" hasArg="true"
                description="maximum queue size for FileResources"/>
        <option opt="fileList" hasArg="true"
                description="file that contains a list of files (relative to inputDir) to process"/>
        <option opt="fileListEncoding" hasArg="true"
                description="encoding for fileList"/>
        <option opt="inputDir" hasArg="true"
                description="root directory for the files to be processed"/>
        <option opt="startDir" hasArg="true"
                description="directory (under inputDir) at which to start crawling"/>
        <option opt="outputDir" hasArg="true"
                description="output directory for output"/> <!-- do we want to make this mandatory -->
        <option opt="recursiveParserWrapper"
                description="use the RecursiveParserWrapper or not (default = false)"/>
        <option opt="streamOut" description="stream the output of the RecursiveParserWrapper (default = false)"/>
        <option opt="handleExisting" hasArg="true"
                description="if an output file already exists, do you want to: overwrite, rename or skip"/>
        <option opt="basicHandlerType" hasArg="true"
                description="what type of content handler: xml, text, html, body"/>
        <option opt="outputSuffix" hasArg="true"
                description="suffix to add to the end of the output file name"/>
        <option opt="timeoutThresholdMillis" hasArg="true"
                description="how long to wait before determining that a consumer is stale"/>
        <option opt="includeFilePat" hasArg="true"
                description="regex that specifies which files to process"/>
        <option opt="excludeFilePat" hasArg="true"
                description="regex that specifies which files to avoid processing"/>
        <option opt="reporterSleepMillis" hasArg="true"
                description="millisecond between reports by the reporter"/>
        <option opt="digest" hasArg="true"
                description="which digest(s) to use, e.g. 'md5,sha512'\"/>
        <option opt="digestMarkLimit" hasArg="true"
                description="max bytes to read for digest\"/>
    </commandline>


    <!-- can specify inputDir="input", but the default config should not include this -->
    <!-- can also specify startDir="input/someDir" to specify which child directory
         to start processing -->
	<crawler builderClass="org.apache.tika.batch.fs.builders.FSCrawlerBuilder"
        crawlOrder="random"
		maxFilesToAdd="-1" 
		maxFilesToConsider="-1" 
		includeFilePat=""
		excludeFilePat=""
		maxFileSizeBytes="-1"
        />
<!--
    This is an example of a crawler that reads a list of files to be processed from a
    file.  This assumes that the files in the list are relative to inputDir.
    <crawler class="org.apache.tika.batch.fs.builders.FSCrawlerBuilder"
             fileList="files.txt"
             fileListEncoding="UTF-8"
             maxFilesToAdd="-1"
             maxFilesToConsider="-1"
             includeFilePat="(?i).pdf$"
             excludeFilePat="(?i).msg$"
             maxFileSizeBytes="-1"
             inputDir="input"
    />
-->
    <!--
        To wrap parser in RecursiveParserWrapper (tika-app's -J or tika-server's /rmeta),
        add attribute recursiveParserWrapper="true" to consumers element.

        To wrap parser with DigestingParser add attributes e.g.:
        digest="md5,sha256" digestMarkLimit="10000000"
        -->
    <consumers builderClass="org.apache.tika.batch.fs.builders.BasicTikaFSConsumersBuilder"
               recursiveParserWrapper="false" consumersManagerMaxMillis="60000">
        <parser builderClass="org.apache.tika.batch.builders.AppParserFactoryBuilder"
                class="org.apache.tika.batch.DigestingAutoDetectParserFactory"
                parseRecursively="true"
                digest="md5" digestMarkLimit="1000000"/>
        <contenthandler builderClass="org.apache.tika.batch.builders.DefaultContentHandlerFactoryBuilder"
                        basicHandlerType="xml" writeLimit="-1"/>
        <!-- can specify custom output file suffix with:
            suffix=".mysuffix"
            if no suffix is specified, BasicTikaFSConsumersBuilder does its best to guess -->
        <!-- can specify compression with
            compression="bzip2|gzip|zip" -->

        <outputstream class="FSOutputStreamFactory" encoding="UTF-8"/>
    </consumers>

    <!-- reporter and interrupter are optional -->
    <reporter builderClass="org.apache.tika.batch.builders.SimpleLogReporterBuilder" reporterSleepMillis="1000"
              reporterStaleThresholdMillis="60000"/>
    <interrupter builderClass="org.apache.tika.batch.builders.InterrupterBuilder"/>
</tika-batch-config>

Reply via email to