Hi Eric,

Follow-up observsations and comments in-line.

What time does this problem start?

It typically starts around 9:15 am EDT and goes until around 10:30 am.

Does that match the time of highest request load from the customer? Rather than a spike, I'm wondering if the problem is triggered once load exceeds some threshold.

We finished and implemented the script yesterday, so today will be the first 
day that it produces results. It watches the catalina.out file for stuck thread 
detection warnings. When the number of stuck threads exceeds a threshold, then 
it starts doing thread dumps every 60 seconds until the counts drops back down 
below the threshold. The users typically do not complain of slowness until the 
stuck thread count exceeds 20, and during that time the threads often take up 
to a minute or more to complete. It's too late today to change the timings, but 
if it does not produce any actionable intel, we can adjust them tonight.

Lets see what that produces and go from there.

The vendor claims that the feature uses a different server and does not send 
requests to the slow ones, so it has been re-enabled at the customer's request. 
We may ask them to disable it again until we get this issue resolved.

Noted.

This customer sends about 1.5 million requests to each load-balanced server 
during a typical production day. Most other customers send much less, often 
only a fraction of that. However, at least one customer sends about 2 million 
to the same server, and they don't see the problem. (I will check if they have 
the AI feature enabled.)

Hmm. Whether that other customer has the AI feature enabled would be an interesting data point.

Can we see the full stack trace please.

Here's one example.

<snip/>

         java.lang.Throwable
                 at 
org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1252)
                 at 
org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1220)

<snip/>

That is *very* interesting. That is the start of a synch block in the class loader. It should complete quickly. The full thread dump should tell us what is holding the lock. If we are lucky we'll be able to tell why the lock is being held for so long.

We might need to reduce the time between thread dumps to figure out what the thread that is blocking everything is doing. We'll see.

The app has DB connection details in two places. First, it uses a database 
connection string in a .properties file, as follows. This string handles most 
connections to the DB.

mobiledoc.DBUrl=jdbc:mysql://ha52a:5791
mobiledoc.DBName=mobiledoc_791?useSSL=false&zeroDateTimeBehavior=round&jdbcCompliantTruncation=false&cacheServerConfiguration=true&dumpQueriesOnException=true&tinyInt1isBit=false&useOldAliasMetadataBehavior=true&dontTrackOpenResources=true
mobiledoc.DBUser=<redacted>
mobiledoc.DBPassword=<redacted>

OK. That seems unlikely to be using connection pooling although the application might be pooling internally.

It also has second DB config specifically for a drug database.

<MediSpan>
   <!-- Log settings -->
   <Logging enabled="false">
     <SQLQuery enabled="true" />
     <ConstructorCall enabled="false" />
     <FunctionCall enabled="false" />
     <CustomLogFileLocation>c:\out.log</CustomLogFileLocation>
   </Logging>
   <!-- Medispan Content definitions -->
   <DataSource id="ecw" DataBase="medispan" />
   <!-- DataSources -->
   <DataBase id="medispan">
     <JNDIEnabled enabled="false" />
     <INITIAL_CONTEXT_FACTORY>INSERT_CONTEXT_FACTORY</INITIAL_CONTEXT_FACTORY>
     <JNDI_URL>INSERT_JNDI_URL</JNDI_URL>
     <JNDI_USER_NAME>INSERT_USER_NAME</JNDI_USER_NAME>
     <JNDI_PASSWORD>INSERT_PASSWORD</JNDI_PASSWORD>
     <LOOKUP_NAME>INSERT_LOOKUP_NAME</LOOKUP_NAME>
     <DRIVER>com.mysql.jdbc.Driver</DRIVER>
     
<URL>jdbc:mysql://dbclust54:5791/medispan?sessionVariables=wait_timeout=28800,interactive_timeout=28800</URL>
     <USER_NAME>redacted</USER_NAME>
     <PASSWORD>redacted</PASSWORD>
     <POOLSIZE>10</POOLSIZE>
     <TIMEOUT>5000</TIMEOUT>
   </DataBase>
   <XMLContent id="DIB310" dir="INSERT_CONTENT_DIR" />
   <Images id="DIB310" dir="INSERT_IMAGE_DIR" />
   <Cache>
     <Enabled>true</Enabled>
     <SecondsTillStale>0</SecondsTillStale>
     <LastUpdateInterval>1800</LastUpdateInterval>
   </Cache>
</MediSpan>

Hmm. There is a pool size setting there but we can't tell if it is being used.

Is that Tomcat 9.0.80 as provided by the ASF?

An explicit answer to this question would be helpful.

In terms of the way forward, we need to see to thread dumps when the problem is happening to figure out where the blockage is happening and (hopefully) why.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to