Hi Eric,
Follow-up observsations and comments in-line.
What time does this problem start?
It typically starts around 9:15 am EDT and goes until around 10:30 am.
Does that match the time of highest request load from the customer?
Rather than a spike, I'm wondering if the problem is triggered once load
exceeds some threshold.
We finished and implemented the script yesterday, so today will be the first
day that it produces results. It watches the catalina.out file for stuck thread
detection warnings. When the number of stuck threads exceeds a threshold, then
it starts doing thread dumps every 60 seconds until the counts drops back down
below the threshold. The users typically do not complain of slowness until the
stuck thread count exceeds 20, and during that time the threads often take up
to a minute or more to complete. It's too late today to change the timings, but
if it does not produce any actionable intel, we can adjust them tonight.
Lets see what that produces and go from there.
The vendor claims that the feature uses a different server and does not send
requests to the slow ones, so it has been re-enabled at the customer's request.
We may ask them to disable it again until we get this issue resolved.
Noted.
This customer sends about 1.5 million requests to each load-balanced server
during a typical production day. Most other customers send much less, often
only a fraction of that. However, at least one customer sends about 2 million
to the same server, and they don't see the problem. (I will check if they have
the AI feature enabled.)
Hmm. Whether that other customer has the AI feature enabled would be an
interesting data point.
Can we see the full stack trace please.
Here's one example.
<snip/>
java.lang.Throwable
at
org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1252)
at
org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1220)
<snip/>
That is *very* interesting. That is the start of a synch block in the
class loader. It should complete quickly. The full thread dump should
tell us what is holding the lock. If we are lucky we'll be able to tell
why the lock is being held for so long.
We might need to reduce the time between thread dumps to figure out what
the thread that is blocking everything is doing. We'll see.
The app has DB connection details in two places. First, it uses a database
connection string in a .properties file, as follows. This string handles most
connections to the DB.
mobiledoc.DBUrl=jdbc:mysql://ha52a:5791
mobiledoc.DBName=mobiledoc_791?useSSL=false&zeroDateTimeBehavior=round&jdbcCompliantTruncation=false&cacheServerConfiguration=true&dumpQueriesOnException=true&tinyInt1isBit=false&useOldAliasMetadataBehavior=true&dontTrackOpenResources=true
mobiledoc.DBUser=<redacted>
mobiledoc.DBPassword=<redacted>
OK. That seems unlikely to be using connection pooling although the
application might be pooling internally.
It also has second DB config specifically for a drug database.
<MediSpan>
<!-- Log settings -->
<Logging enabled="false">
<SQLQuery enabled="true" />
<ConstructorCall enabled="false" />
<FunctionCall enabled="false" />
<CustomLogFileLocation>c:\out.log</CustomLogFileLocation>
</Logging>
<!-- Medispan Content definitions -->
<DataSource id="ecw" DataBase="medispan" />
<!-- DataSources -->
<DataBase id="medispan">
<JNDIEnabled enabled="false" />
<INITIAL_CONTEXT_FACTORY>INSERT_CONTEXT_FACTORY</INITIAL_CONTEXT_FACTORY>
<JNDI_URL>INSERT_JNDI_URL</JNDI_URL>
<JNDI_USER_NAME>INSERT_USER_NAME</JNDI_USER_NAME>
<JNDI_PASSWORD>INSERT_PASSWORD</JNDI_PASSWORD>
<LOOKUP_NAME>INSERT_LOOKUP_NAME</LOOKUP_NAME>
<DRIVER>com.mysql.jdbc.Driver</DRIVER>
<URL>jdbc:mysql://dbclust54:5791/medispan?sessionVariables=wait_timeout=28800,interactive_timeout=28800</URL>
<USER_NAME>redacted</USER_NAME>
<PASSWORD>redacted</PASSWORD>
<POOLSIZE>10</POOLSIZE>
<TIMEOUT>5000</TIMEOUT>
</DataBase>
<XMLContent id="DIB310" dir="INSERT_CONTENT_DIR" />
<Images id="DIB310" dir="INSERT_IMAGE_DIR" />
<Cache>
<Enabled>true</Enabled>
<SecondsTillStale>0</SecondsTillStale>
<LastUpdateInterval>1800</LastUpdateInterval>
</Cache>
</MediSpan>
Hmm. There is a pool size setting there but we can't tell if it is being
used.
Is that Tomcat 9.0.80 as provided by the ASF?
An explicit answer to this question would be helpful.
In terms of the way forward, we need to see to thread dumps when the
problem is happening to figure out where the blockage is happening and
(hopefully) why.
Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org