[ https://issues.apache.org/jira/browse/COUCHDB-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508271#comment-13508271 ]
Adam Kocoloski commented on COUCHDB-1346: ----------------------------------------- I did some more reading and log-based debugging this morning. We're doing something rather screwy. It turns out the the individual view index processes are *already* monitoring their parent DB; when we delete the DB, the index is automatically closed: https://github.com/apache/couchdb/blob/bde29b/src/couch_index/src/couch_index.erl#L80 https://github.com/apache/couchdb/blob/bde29b/src/couch_index/src/couch_index.erl#L296-L300 The handler for that 'DOWN' message will invoke the gen_server's terminate function, which closes the file descriptor cleanly. So that's A Good Thing. Unfortunately, we don't always let that cleanup run to completion, because we've got this separate DB update notification listener: https://github.com/apache/couchdb/blob/bde29b/src/couch_index/src/couch_index_server.erl#L159-L168 The couch_index processes do not trap exits, so when the reset_indexes function calls shutdown_sync it terminates the couch_index process immediately, bypassing any additional cleanup that we wanted to do. The patch I wrote allows for the termination to finish cleanly. It seems to me that even with the existing shutdown_sync invocation we'd eventually close all the file descriptors because of exit signal propagation, but at that point we may be racing the process that tries to delete them (and when we lose, we hang). The clean shutdown avoids that race. > CouchDB hangs during start of view indexing > ------------------------------------------- > > Key: COUCHDB-1346 > URL: https://issues.apache.org/jira/browse/COUCHDB-1346 > Project: CouchDB > Issue Type: Bug > Components: View Server Support > Affects Versions: 1.3 > Environment: Windows 7 Enterprise only, not able to replicate on Mac > OS X. > Erlang R14B03 + crypto patches. > Mozilla Javascript 1.8.5 > Reporter: Dave Cottlehuber > Assignee: Adam Kocoloski > Priority: Blocker > Labels: Windows > Fix For: 1.3 > > > [info] [<0.20499.0>] Opening index for db: test_suite_db idx: > f4421bf4e9c9bf2acb3db91bca9e9adc sig: "d5c87ad33242b181f86be2139cbccd96" > [info] [<0.20504.0>] Starting index update for db: test_suite_db idx: > f4421bf4e9c9bf2acb3db91bca9e9adc > [info] [<0.20334.0>] 172.16.40.1 - - POST /test_suite_db/_temp_view 500 > [info] [<0.20513.0>] 172.16.40.1 - - GET > /_utils/couch_tests.html?script/couch_tests.js 200 > [info] [<0.20514.0>] 172.16.40.1 - - GET /_utils/index.html 200 > [info] [<0.20060.0>] 172.16.40.1 - - DELETE /test_suite_db_a/ 200 > [info] [<0.20407.0>] 172.16.40.1 - - GET /test_suite_reports/ 404 > [info] [<0.20058.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404 > [info] [<0.20071.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404 > [info] [<0.20069.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404 > [info] [<0.20484.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404 > [info] [<0.20364.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404 > [info] [<0.20062.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404 > [info] [<0.20388.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404 > [info] [<0.20345.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404 > [info] [<0.20072.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404 > [info] [<0.20059.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404 > [info] [<0.20061.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404 > [info] [<0.20472.0>] 172.16.40.1 - - DELETE /test_suite_db/ 200 > [error] [<0.20050.0>] ** Generic server couch_index_server terminating > ** Last message in was {'$gen_cast',{reset_indexes,<<"test_suite_db">>}} > ** When Server state == {st,"../var/lib/couchdb"} > ** Reason for termination == > ** {{case_clause,{error,eacces}}, > [{couch_file,'-nuke_dir/2-fun-0-',3}, > {lists,foreach,2}, > {couch_file,nuke_dir,2}, > {couch_index_server,handle_cast,2}, > {gen_server,handle_msg,5}, > {proc_lib,init_p_do_apply,3}]} > =ERROR REPORT==== 23-Nov-2011::21:17:14 === > ** Generic server couch_index_server terminating > ** Last message in was {'$gen_cast',{reset_indexes,<<"test_suite_db">>}} > ** When Server state == {st,"../var/lib/couchdb"} > ** Reason for termination == > ** {{case_clause,{error,eacces}}, > [{couch_file,'-nuke_dir/2-fun-0-',3}, > {lists,foreach,2}, > {couch_file,nuke_dir,2}, > {couch_index_server,handle_cast,2}, > {gen_server,handle_msg,5}, > {proc_lib,init_p_do_apply,3}]} > [error] [<0.20050.0>] {error_report,<0.19957.0>, > {<0.20050.0>,crash_report, > [[{initial_call, > {couch_index_server,init,['Argument__1']}}, > {pid,<0.20050.0>}, > {registered_name,couch_index_server}, > {error_info, > {exit, > {{case_clause,{error,eacces}}, > [{couch_file,'-nuke_dir/2-fun-0-',3}, > {lists,foreach,2}, > {couch_file,nuke_dir,2}, > {couch_index_server,handle_cast,2}, > {gen_server,handle_msg,5}, > {proc_lib,init_p_do_apply,3}]}, > [{gen_server,terminate,6}, > {proc_lib,init_p_do_apply,3}]}}, > {ancestors, > [couch_secondary_services,couch_server_sup, > <0.19958.0>]}, > {messages, > [{'$gen_cast', > > {reset_indexes,<<"test_suite_db_a">>}}]}, > {links,[<0.20051.0>,<0.20026.0>]}, > {dictionary,[]}, > {trap_exit,true}, > {status,running}, > {heap_size,1597}, > {stack_size,24}, > {reductions,12211}], > [{neighbour, > [{pid,<0.20051.0>}, > {registered_name,[]}, > {initial_call, > {couch_event_sup,init,['Argument__1']}}, > {current_function,{gen_server,loop,6}}, > {ancestors, > [couch_index_server, > couch_secondary_services, > couch_server_sup,<0.19958.0>]}, > {messages,[]}, > {links,[<0.20050.0>,<0.20018.0>]}, > {dictionary,[]}, > {trap_exit,false}, > {status,waiting}, > {heap_size,233}, > {stack_size,9}, > {reductions,32}]}]]}} > =CRASH REPORT==== 23-Nov-2011::21:17:14 === > crasher: > initial call: couch_index_server:init/1 > pid: <0.20050.0> > registered_name: couch_index_server > exception exit: {{case_clause,{error,eacces}}, > [{couch_file,'-nuke_dir/2-fun-0-',3}, > {lists,foreach,2}, > {couch_file,nuke_dir,2}, > {couch_index_server,handle_cast,2}, > {gen_server,handle_msg,5}, > {proc_lib,init_p_do_apply,3}]} > in function gen_server:terminate/6 > ancestors: [couch_secondary_services,couch_server_sup,<0.19958.0>] > messages: [{'$gen_cast',{reset_indexes,<<"test_suite_db_a">>}}] > links: [<0.20051.0>,<0.20026.0>] > dictionary: [] > trap_exit: true > status: running > heap_size: 1597 > stack_size: 24 > reductions: 12211 > neighbours: > neighbour: [{pid,<0.20051.0>}, > {registered_name,[]}, > {initial_call,{couch_event_sup,init,['Argument__1']}}, > {current_function,{gen_server,loop,6}}, > {ancestors,[couch_index_server,couch_secondary_services, > couch_server_sup,<0.19958.0>]}, > {messages,[]}, > {links,[<0.20050.0>,<0.20018.0>]}, > {dictionary,[]}, > {trap_exit,false}, > {status,waiting}, > {heap_size,233}, > {stack_size,9}, > {reductions,32}] > [error] [<0.20026.0>] {error_report,<0.19957.0>, > {<0.20026.0>,supervisor_report, > [{supervisor,{local,couch_secondary_services}}, > {errorContext,child_terminated}, > {reason, > {{case_clause,{error,eacces}}, > [{couch_file,'-nuke_dir/2-fun-0-',3}, > {lists,foreach,2}, > {couch_file,nuke_dir,2}, > {couch_index_server,handle_cast,2}, > {gen_server,handle_msg,5}, > {proc_lib,init_p_do_apply,3}]}}, > {offender, > [{pid,<0.20050.0>}, > {name,index_server}, > {mfargs,{couch_index_server,start_link,[]}}, > {restart_type,permanent}, > {shutdown,brutal_kill}, > {child_type,worker}]}]}} > OS process tree at this time is: > Process information for SENDAI: > Name Pid Pri Thd Hnd VM WS Priv > Idle 0 0 2 0 0 24 0 > System 4 8 79 477 3380 304 108 > explorer 1984 8 21 664 213732 46340 21540 > cmd 2104 8 1 25 48132 3304 2144 > pslist 2776 13 1 133 63584 4976 2000 > cmd 2504 8 1 26 44980 3512 3012 > werl 2680 8 16 390 196232 40064 28628 > win32sysinfo 1152 8 1 21 12624 2124 640 > couchspawnkillable 1444 8 1 30 12992 2284 688 > couchjs 1468 8 1 39 55900 6572 4056 > couchspawnkillable 2740 8 1 30 12992 2280 684 > couchjs 2756 8 1 39 55900 7108 4444 > Erlang resumes running CouchDB when couchjs procs are terminated with extreme > prejudice. The hang still occurs after reverting fdmanana's COUCHDB-1334 > commit. This could be a race condition during invalidation of the views, and > subsequent deletion of the related ddoc view directory prior to reindexing. > On Windows a filesystem object cannot be deleted if there are open handles > remaining. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira