On Tue, Jan 13, 2015 at 8:39 PM, Christopher Schultz <
ch...@christopherschultz.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Darren,
>
> (Sorry... just had to remove that monstrous stack trace...)
>
> On 1/13/15 5:04 PM, Darren Davis wrote:
> > Hi Christopher.  Yes, we've tried a show process list and can find
> > no evidence of the validation query running on mysql.
>
> Strange. Maybe you are waiting for the db server's buffer to flush or
> something like that.
>
>
I think this is because the client thinks it still has an open connection,
the client net stat command shows an open connection over port 3306, at
least for a few minutes after it's killed by the load balancer.  The Server
loses its connection in netstat immediately.


> > We also just tried an experiment outside of Tomcat completely, but
> > connecting to a downed web server host and manually opening up a
> > mysql client connection to the data server and executing a single
> > command.
> >
> > We left that client window idle for an hour and 5 minutes, and
> > attempted to execute a simple select count(*) command against a
> > tiny table.  The client attempted to execute the query, and a
> > NetStat on that box showed an open connection between the two
> > servers using port 3306.  We also checked the process list during
> > this time and could not find any queries at all from the sever in
> > question.
> >
> > At about the 15 minute wait mark, the client finally came back with
> > this message: "ERROR 2013 (HY000): Lost connection to MySQL server
> > during query.
>
> Was this with the MySQL command-line client? What query did you issue
> ("SELECT 1")?
>
>
Yes, it was just the command line client, and we issued a select count(*)
from a table with a couple rows in it.


> > Attempting the execute the command a 2nd time (using the up
> > arrow), re-established the connection and it ran perfectly in a few
> > milliseconds.
>
> That's interesting. I've never experienced anything like that with
> MySQL, but we use a VLAN between our application and database servers
> with no hardware firewall, so we don't have any connection timeout
> problems. Also, when connections are dropped due to inactivity, they
> re-connect without any problems.
>
> > I checked the mysql configuration and it is set to the default
> > values for keeping connections/interactive connections open (for 8
> > hours), so it seems that maybe the Cisco firewall between the two
> > servers is terminating connections out from under us, but in a way
> > what the O/S cannot detect it.
>
> What if you set that idle connection timeout to something like 5
> minutes? Can you reproduce this issue more quickly? Can you look at
> the fw configuration to see if you can change the idle timeout /down/
> to something more testable?
>
> As part of our move to the new versions of Tomcat/Java, we are in a new
cloud environment which features a different type of firewall.  The
provider confirmed to us late today, that it is configured to kill "idle"
TCP connections after an hour, which is something our old firewall didn't
do.

Because we sometimes have low traffic during this time of the year,
especially on the weekends, what we think is happening is that one or more
of the minimum 10 connections is going unused for more than an hour, and
since we didn't have any connection testing while idle turned on, they were
being killed by the firewall out from under the pool, and depending on how
soon they were used after that, we would run into the 15 minute delay
before they were deemed lost, and replaced with a new connection.


> > I've also fired up the yourKit profiler on this box and am seeing
> > other threads which have had to wait in the same
> > SocketInputStream.read code, all three started a few seconds apart,
> > it just wasn't detected as a deadlock, because it took place
> > outside of any synchronized methods.
>
> What makes you think it's deadlock? Deadlock is a very specific thing.
> Just because many threads are waiting in SocketInputStream.read
> doesn't mean there are any threading issues at all. I suspect that
> each SocketInputStream is distinct and only in use by a single thread.
> The threads are blocked on I/O, right? So they aren't waiting on a
> monitor. The best you could do would be to find the native file
> descriptor for each socket and determine that they are different from
> each other. I would be very surprised if they are the same, used
> across threads. If you *are* using Connection objects across threads,
> you should be very careful. Connection objects ought to be threadsafe
> (I think) but use of Statement and ResultSet objects across threads is
> a terrible idea.
>
> We have a couple of synchronized methods in two of our services which hold
locks in order to update a centralized record.  We realize this is a bad
design and are already working on re-factoring this code to remove this
need.  We've have a few instances where the 15 minute wait has happened
inside of the synchronized block of code, meaning that several other
threads are also having to wait for the 15 minutes to end before they get
their turn inside that block of code.  Our website analyzer (YourKit) can
detect this blocked threads as deadlocks, and colors them red in the thread
graph.


> > It seems that sometime around the hour mark, connections get
> > dropped, so we're thinking that either adding idle checking or
> > dropping old connections may help us avoid this.  Although we are a
> > little concerned by the various Connector / J alleged socket read
> > issues which may as a possible problem.
>
> I don't think you should blame Connector/J at this point. They may
> have ClassLoader pinning issues (don't get me started), but the driver
> is fairly robust and mature.
>
> > We're running an older 5.1.18 version of the Connector/J driver,
> > but aren't sure of moving to the latest .34 release would change
> > anything.
>
> We are also still using 5.1.18 and have never had any of these kinds
> of issues. I would highly suspect the network environment. See what
> you can find out by tinkering with the firewall and db idle policies.
> You may find that the pipe across the network gets into a state where
> the client is sure the connection is still valid, but it's simply
> never going to return any data. In that case, you'll need to figure
> out how to have that connection fail faster.
>
> Do you have a read-timeout set on your driver?
>
> We have re-configured the driver by setting maxAge to 360000,
testWhileIdle to true, validationQueryTimeout to 5, logValidationErrors to
true.  We restarted the Tomcat service 5 1/2 hours ago, and so far haven't
seen this error return yet.  We are also hopeful that even if it did, the
validationQueryTimeout change would prevent it from holding things up for
15 minutes.

We really appreciate all of the feedback from you and others today, since
we've focused most of our dev team on trying to understand and troubleshoot
this issue for the past few days.



> - -chris
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: GPGTools - http://gpgtools.org
>
> iQIcBAEBCAAGBQJUteT3AAoJEBzwKT+lPKRY7BUP/3IV0Xsakr3rWRpqnro1IbUl
> nbNHHIm9fqG+7mbvkeIQIE5XnZvA822HZvp9whC+4499kfQZNtrT0IIj1F20YH5r
> SMMkalCbY6XIzj1ST4aPf7YE2MlBtBwFZUwIGG2aT2XUKwSwHVdTcQxI2H4sG5vf
> iCkvS7YdJg5h6QSj5CQHg6dnsVR2hbF42tftti33hOsRPZu3cXOe0ajrXsoimMuk
> WWt+hpk8rjWtEnrMgaKlyntKGAI2tqXYVzPxraR3wwevm1tbHjHk2U3hFrq9teuV
> FA57RhWTlba/OJ+ph+LEiT39IdEdzESspTI+JeQvN5LJEsaMpxmRpnmLnhD/3EXx
> aNRze3eRw5M7qG0CcMduCMFe1j2i8TCwBLtHHJnplXWzve9PgqbJBtk7acJpn/Ls
> 54j23u5Z26TvAAJxiCa6/zxiJ6xRZDLfxfZsYVMImRHpC9s+GDPuAylUjaXVmoa+
> HAIEQGUxTI16oQQZIG6mevehNtT8ik+zwVLMSk+QonvDRnRxsyPr8jItSanG3YXb
> th0kyE99y1rogJ+zeC9S+8NBiNkrU9EH7uUWZY7WyLuEHC+EyjCVaAV8SB+QBKKm
> PLs/EFnEVLLQWpj7Gzl7/421Fy6ttemeDEfj+VO2kSo7Wsy4kW3hcH/5spFCqeFj
> +WEiED5vvH5w9LrOkUcL
> =hPlo
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org
>
> Hi Christopher,

Reply via email to