I just wanted to confirm that the problem was apparently solved after upgrading web2py to version 2.16.1 and also upgrading Redis to latest stable 64bit version (version 1.8.1). The app has been running for a couple of weeks now and the problem didn't arise. I truly sorry having bothered you with this, considering that the problem could be solved just upgrading web2py and Redis. We also have just moved our production deployment to CentOS 7 for better stability.
Once again, I deeply thank all of you for your time and your constant willingness to help! Best regards, Lisandro. El domingo, 6 de mayo de 2018, 0:13:06 (UTC-3), Massimo Di Pierro escribió: > > That is a very old version of web2py. The problem may have been solved > years ago. ;-) > > On Thursday, 3 May 2018 16:37:53 UTC-5, Lisandro wrote: >> >> I came back to reply this particular message, because the problem >> happened again yesterday. >> >> Just in order to remember, the problem was with redis_cache.py failing to >> write some specific key name. >> Not *__lock object was found in the cache during the incident. >> Connecting manually to redis-cli would show that the key isn't present >> and can be written and read. >> But from the web2py application (using the redis_cache.py adapter), the >> application hangs with that specific key. >> The only solution is to change the key name. Weird. >> >> >> Yesterday, our Redis server crashed (it reported out of memory, but the >> server had plenty of RAM available, this was due to an issue with memory >> fragmentation). >> Anyway, after that crash, I did a flushall and restarted Redis server. >> Everything started to work ok for almost the total 260 websites we run >> with web2py, but 4 of them kept hanging. >> All the requests to those websites were hanging. I put some log >> sentences, and I found out that it was the same problem, this time with >> another random key name. >> >> The solution was, again, to de-compile the apps, manually edit the source >> code and change the key name. But of course, this is not a solution. >> Again, this time there was no *__lock key in the cache, and I could >> succesfully use redis-cli to write and read the key name. >> >> I don't have a proper solution yet. >> redis-cli flushall doesn't help. >> >> >> I'm using an old web2py version (Version >> 2.10.3-stable+timestamp.2015.04.02.21.42.07). >> I already have plans to upgrade. >> But before, could I manually update only gluon/contrib/redis_*.py files? >> In your opinion, does that test worthwhile? Does it make any sense? >> >> >> >> El viernes, 20 de abril de 2018, 8:28:28 (UTC-3), Lisandro escribió: >>> >>> Sorry to bother you again with this, but I think I've found the problem. >>> *The problem is apparently with Redis integration. *It had nothing to >>> do with connections, database, sessions, none of that. Here is what I've >>> found. >>> >>> Remember, the line where my app hangs is this: >>> >>> *session.important_messages = cache.redis('important-messages-%s' % >>> auth.user.id <http://auth.user.id/>,* >>> * lambda: >>> get_important_messages(), * >>> * time_expire=180)* >>> >>> >>> As the problem only presented in production, on the website of my >>> customer, I asked him to allow me to play a little with the code. >>> So, first thing I did was to cache request.now instead of calling the >>> function "get_important_messages()", but the problem remained. >>> Then I thought "maybe if I change the key..." and I changed the original >>> code to this: >>> >>> *session.important_messages = cache.redis('important-messages',* >>> * lambda: >>> get_important_messages(),* >>> * time_expire=180)* >>> >>> >>> *Notice that only thing I changed was the key to store in Redis. And it >>> worked! *I thought that maybe "auth.user.id" was some large number, but >>> I checked and the user ID is 3. Tried to pass it like int(auth.user.id) >>> but I had no success. *App still hangs when I try to retrieve that >>> specific key*. Only that key. >>> >>> I've connected to redis-cli and it tells me that the key isn't there. >>> So I set a "hello" value for the key, I get it, then I deleted it: >>> >>> $ redis-cli >>> 127.0.0.1:6379> DUMP w2p:myapp:important-messages-3 >>> (nil) >>> 127.0.0.1:6379> SET w2p:myapp:important-messages-3 "hello" >>> OK >>> 127.0.0.1:6379> DUMP w2p:myapp:important-messages-3 >>> "\x00\x05hello\x06\x00\xf5\x9f\xb7\xf6\x90a\x1c\x99" >>> 127.0.0.1:6379> DEL w2p:myapp:important-messages-3 >>> (integer) 1127.0.0.1:6379> DUMP w2p:myapp:important-messages-3 >>> 127.0.0.1:6379> DUMP w2p:myapp:important-messages-3 >>> (nil) >>> >>> >>> But event after that, web2py hangs with this simple code: >>> >>> *r = cache.redis('important-messages-3', **lambda: request.now, * >>> *time_expire=30)* >>> >>> This happens only with that specific key. I can set the key to >>> "important-messages-2", "important-messages-999", "important-messages-A", >>> anything I can think, but with that specific key it hangs. >>> >>> We have several websites (around 200), and this problem has happened >>> about 5 o 6 times in different websites, but it was always the same >>> problem. The only solution I had (until now) was to create a new account >>> for the user (that explains why it worked with a new account, that is >>> because the new account had a different auth.user.id, so the key to >>> store in redis was different). >>> >>> Could this be a bug in the redis_cache.py integration? >>> Maybe I should open a new thread about this, right? >>> >>> >>> El jueves, 19 de abril de 2018, 10:27:46 (UTC-3), Lisandro escribió: >>>> >>>> Hi there, >>>> I've found the issue but I still don't know how is it produced. >>>> Anthony was right from the begining when he said "the app is not >>>> hanging because the locks are being held, but rather the locks are being >>>> held because the app is hanging" >>>> Since that comment, I was waiting for the problem to happen again to >>>> decompile the app and print some logs to see exactly the line of code >>>> where >>>> the application hangs. >>>> >>>> So that's what I did, and *I've found that my app indeed hangs in an >>>> specific line of code of models/db.py:* >>>> This is my models/db.py resumed: >>>> >>>> >>>> if auth.is_logged_in() and auth.user.responsable: >>>> >>>> >>>> >>>> *# ----------- THIS IS THE LINE WHERE THE CODE HANGS ----------* >>>> *session.important_messages = cache.redis('important_messages-%s' >>>> % auth.user.id <http://auth.user.id>,* >>>> * lambda: >>>> get_important_messages(), * >>>> * time_expire=180)* >>>> >>>> >>>> >>>> >>>> So I checked what the function "get_important_messages()" does, and I >>>> see that it connects to a webservice (also developed with web2py): >>>> >>>> >>>> def get_important_messages(): >>>> from gluon.contrib.simplejsonrpc import ServerProxy >>>> >>>> webservice = ServerProxy(' >>>> https://main-app-domain.com/ws/call/jsonrpc?token=XXX1') >>>> try: >>>> result = webservice.get_account_info(CONFIG.customer_id) >>>> except Exception as e: >>>> result = [] >>>> return result >>>> >>>> >>>> >>>> Then I went to double check my nginx error.log, this time looking for >>>> errors in the URL that the app uses to connect to the webservice. >>>> Surprisingly, I'm seeing a few timeouts everyday for that URL: >>>> >>>> 2018/04/17 15:08:22 [error] 23587#23587: *93711423 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX1 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/17 15:08:22 [error] 23587#23587: *93711449 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX2 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/17 15:08:36 [error] 23582#23582: *93711928 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX1 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/17 15:09:04 [error] 23582#23582: *93713029 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX3 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/17 15:09:16 [error] 23591#23591: *93713451 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX1 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/17 15:09:24 [error] 23582#23582: *93713819 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX4 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/17 15:09:25 [error] 23582#23582: *93713839 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX5 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/17 15:10:25 [error] 23582#23582: *93716003 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX1 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/17 15:12:34 [error] 23591#23591: *93720887 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX6 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/17 15:12:36 [error] 23590#23590: *93720938 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX7 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/17 15:12:50 [error] 23589#23589: *93721468 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX8 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/16 10:39:39 [error] 16600#16600: *89723537 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX7 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/16 10:40:10 [error] 16601#16601: *89724987 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX9 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/16 10:40:11 [error] 16602#16602: *89725040 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX9 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/16 16:59:46 [error] 17874#17874: *90771814 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX8 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/16 17:00:56 [error] 17877#17877: *90774663 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX8 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/16 17:01:11 [error] 17879#17879: *90775407 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX9 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> 2018/04/15 13:46:46 [error] 11395#11395: *86829630 upstream timed out >>>> (110: Connection timed out) while reading response header from upstream, >>>> client: MY.OWN.SERVER.IP, server: main-app-domain.com, request: "POST >>>> /ws/call/jsonrpc?token=XXX9 HTTP/1.1", upstream: >>>> "uwsgi://unix:///tmp/medios.socket", host: "main-app-domain.com" >>>> >>>> >>>> So, what I know now is that *the problem are these timeouts that occur >>>> ocasionally when an app tries to connect the main webservice with this >>>> code:* >>>> >>>> webservice = ServerProxy(' >>>> https://main-app-domain.com/ws/call/jsonrpc?token=XXX1' >>>> >>>> >>>> >>>> This is the code of the ws.py controller that implements the webservice: >>>> >>>> # -*- coding: utf-8 -*- >>>> >>>> from gluon.tools import Service >>>> >>>> service = Service() >>>> >>>> >>>> def call(): >>>> if not request.vars.token or not db(db.websites.token == >>>> request.vars.token).count(): >>>> raise HTTP(403) >>>> session.forget() >>>> return service() >>>> >>>> >>>> >>>> Notice that the call receives a token, and every app that tries to >>>> connect has its own token, in order to validate the connection. >>>> I'm not sure why some of the calls to the webservice hang, but I'm sure >>>> of this: >>>> >>>> - While some of these calls time out, other identical calls work >>>> properly (and they are all identical, just calls to connect to the >>>> webservice). >>>> - Just in case, I've checked that my nginx configuration isn't >>>> applying requests limits to my server IP or something like that, but no >>>> warning or error regarding this is showed in the nginx error.log >>>> - Also, just in case, I checked my pgBouncer log to see if >>>> connections to the main database are exhausted, but that's not the case >>>> either (actually, if this was the case, I would see error tickets >>>> created >>>> and also any other attempt of connection to the webservice would fail, >>>> when >>>> this is not happening). >>>> >>>> >>>> Now I'm lost here, I don't see how the attempt of connection to the >>>> webservice could fail. >>>> Maybe network problems, but they should affect other connections as >>>> well. >>>> >>>> Any comment or suggestion will be much apreciated. >>>> Regards, >>>> Lisandro. >>>> >>>> >>>> >>>> >>>> >>>> >>>> El lunes, 16 de abril de 2018, 18:57:47 (UTC-3), Lisandro escribió: >>>>> >>>>> Hi, thank you both for your time and concern. >>>>> >>>>> @Richard: this particular website was still running with sessions >>>>> stored in Redis. As we have several websites, moving sessions to Redis is >>>>> something that we will do progressively in the next weeks. >>>>> >>>>> @Anthony: the database server is PostgreSQL, running in the same VPS, >>>>> so I wouldn't say it's due to network problems. I do have pgBouncer and I >>>>> limit the pool size to only 1 connection (with 2 of reserve pool) per >>>>> database. The app didn't have much load (actually it almost never has), >>>>> but >>>>> in this situation, with that query hanging for 60 seconds, it's probable >>>>> that the ticket error was because there were no more connections >>>>> available >>>>> for that db (for example, if the user with the problem tried >>>>> simultaneously >>>>> in a laptop, in a pc and in his mobile phone). >>>>> >>>>> >>>>> Some (weird) points about the problem: >>>>> >>>>> - While it presents in an specific account, other user accounts >>>>> can login and work perfectly with the app. >>>>> - As an admin, I have the permission to impersonate other user >>>>> accounts. When the problem happens, I can impersonate any account but >>>>> the >>>>> one with the problem (the impersonation is successfull, but the same >>>>> timeout presents after I'm impersonating the account). >>>>> - Problem doesn't go away deleting all web2py_session_table >>>>> records and clearing cookies. >>>>> - Problem doesn't go away changing the account email or password. >>>>> - The only solution I've been applying last times it happened, was >>>>> to create a new account for the user and invalidate the old one. >>>>> >>>>> >>>>> Today, when the problem happened, I created the new account for the >>>>> user and moved the sessions to Redis. Maybe I should have kept sessions >>>>> in >>>>> the db, in order to debug the problem with that account. Now it's not >>>>> possible anymore, because I already moved to Redis. Of course I could >>>>> move >>>>> back sessions to db, but I don't like the idea of debugging at production >>>>> in the website of a customer, specially one who had a recent issue with >>>>> this. >>>>> >>>>> So, I'll wait if it happens again, and I'll try to leave the account >>>>> there to do some tests. >>>>> Thank you very much for your time! >>>>> >>>>> >>>>> El lunes, 16 de abril de 2018, 17:31:47 (UTC-3), Anthony escribió: >>>>>> >>>>>> Where is the database server running? Is it possible there are >>>>>> occasional network problems connecting to it? >>>>>> >>>>>> Anthony >>>>>> >>>>>> On Monday, April 16, 2018 at 3:15:54 PM UTC-4, Lisandro wrote: >>>>>>> >>>>>>> Hi there, sorry to bother again, I have some additional info that >>>>>>> could help. >>>>>>> >>>>>>> The problem happened again, exactly the same as the other times. >>>>>>> But this time an error ticket was created with this traceback: >>>>>>> >>>>>>> - >>>>>>> >>>>>>> Traceback (most recent call last): >>>>>>> File "/var/www/medios/gluon/main.py", line 463, in wsgibase >>>>>>> session._try_store_in_db(request, response) >>>>>>> File "/var/www/medios/gluon/globals.py", line 1152, in >>>>>>> _try_store_in_db >>>>>>> if not table._db(table.id == record_id).update(**dd): >>>>>>> File "/var/www/medios/gluon/packages/dal/pydal/objects.py", line >>>>>>> 2117, in update >>>>>>> ret = db._adapter.update("%s" % >>>>>>> table._tablename,self.query,fields) >>>>>>> File "/var/www/medios/gluon/packages/dal/pydal/adapters/base.py", >>>>>>> line 988, in update >>>>>>> raise e >>>>>>> DatabaseError: query_wait_timeout >>>>>>> server closed the connection unexpectedly >>>>>>> This probably means the server terminated abnormally >>>>>>> before or while processing the request. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Could this indicate that for some reason web2py is failing to store >>>>>>> the session? >>>>>>> Or could it still be that a deadlock in my app code is producing >>>>>>> this error? >>>>>>> >>>>>>> >>>>>>> El viernes, 6 de abril de 2018, 18:59:28 (UTC-3), Lisandro escribió: >>>>>>>> >>>>>>>> Oh, I see, you made a good point there, I hadn't realised. >>>>>>>> >>>>>>>> I guess I will have to take a closer look to my app code. >>>>>>>> Considering that the problem exists in specific accounts while others >>>>>>>> work >>>>>>>> ok, and considering also that the problem happens with any request >>>>>>>> that >>>>>>>> that specific user makes to any controller/function, I'm thinking: >>>>>>>> what >>>>>>>> does my app do different for a user compared to another one at request >>>>>>>> level? For "request level" I mean all the code the app runs in every >>>>>>>> request, to start, the models/db.py >>>>>>>> >>>>>>>> I'll take a closer look to that and will post another message here >>>>>>>> if I find something that could signal the root cause of the issue. >>>>>>>> >>>>>>>> Thank you very much for your help! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> El viernes, 6 de abril de 2018, 16:05:13 (UTC-3), Anthony escribió: >>>>>>>>> >>>>>>>>> On Friday, April 6, 2018 at 10:58:56 AM UTC-4, Lisandro wrote: >>>>>>>>>> >>>>>>>>>> Yes, in fact, I've been running that SQL command to check for >>>>>>>>>> locks, and sometimes I see that lock on other tables, but that other >>>>>>>>>> locks >>>>>>>>>> live for less than a second. However, when the problem happens, the >>>>>>>>>> lock on >>>>>>>>>> the auth_user and web2py_session tables remains there for the whole >>>>>>>>>> 60 >>>>>>>>>> seconds. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Yes, but that doesn't mean the lock or the database has anything >>>>>>>>> to do with the app hanging. The locks will be held for the duration >>>>>>>>> of the >>>>>>>>> database transaction, and web2py wraps HTTP requests in a >>>>>>>>> transaction, so >>>>>>>>> the transaction doesn't end until the request ends (unless you >>>>>>>>> explicitly >>>>>>>>> call db.commit()). In other words, the app is not hanging because the >>>>>>>>> locks >>>>>>>>> are being held, but rather the locks are being held because the app >>>>>>>>> is >>>>>>>>> hanging. First you have to figure out why the app is hanging (it >>>>>>>>> could be >>>>>>>>> the database, but could be something else). >>>>>>>>> >>>>>>>>> Anthony >>>>>>>>> >>>>>>>> -- Resources: - http://web2py.com - http://web2py.com/book (Documentation) - http://github.com/web2py/web2py (Source code) - https://code.google.com/p/web2py/issues/list (Report Issues) --- You received this message because you are subscribed to the Google Groups "web2py-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to web2py+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.