Hi Stefan, while we had a discussion at Slack [1] (found by Jan at [2]) about the atomicity of „rename", could it be a similar problem here (Linux/qemu/fs stack)?
In [2] they could workaround their problem with waiting some time after renaming? @Stefan, maybe you could try to wait some time after renaming/closing the db? Cheers, -Ronny [1] https://couchdb.slack.com/archives/C01TBE2J197/p1678355980122119 [2] https://toot.cat/@zkat/109973167110793372 > Am 13.03.2023 um 09:50 schrieb Stefan Kral <stefan.k...@emlix.com>: > > Hi Jan, > > here you go: https://github.com/emlix/couchdb-yocto > > the mentioned patch is here > https://github.com/emlix/couchdb-yocto/blob/main/meta-couchdb/recipes-core/couchdb/files/0001-swap-fds.patch > > when you run the comaction test (see README do get there) > /usr/lib/test-couchdb/test-compaction.sh > > you will find in the (/var/log/couchdb/couch.log) log as last line: > [debug] [<0.173.0>] before gen_server:call > > Thanks, > Stefan > > Am 02.03.23 um 13:45 schrieb Jan Lehnardt: >> Hi Stefan, >> >> Thanks for the additional info. I’m happy to try a yocto build here. >> >> Best >> Jan >> — >> >>> On 2. Mar 2023, at 12:24, Stefan Kral <stefan.k...@emlix.com> wrote: >>> >>> Hi, >>> >>> I can give you some background context: our CouchDB instance is running >>> on a embedded device (with minimal attack vector, so we have no pressure >>> to mitigate CVEs). CouchDB has been chosen because of its write append >>> and power fail safe property (and because of the easy scriptable >>> curl/json interface). >>> >>> Currently there is a production system running on a SMB1 share (mounted >>> in a Linux host) which works well (at least for our uses cases). SMB1 is >>> not logner the default on the Windows remote side. And SMB2/3 has an >>> issue with opening a renamend but not closed filedescriptor. The >>> question is, wether we can solve this issue with minimal changes. >>> >>>> 1. How did you verify that the gen_server:call/3 call never returns? >>>> 2. Do you get any pertinent lines (especially crashes) in your >>>> couch.log? >>> >>> by adding: >>> >>>> + ?LOG_DEBUG("before gen_server:call", []), >>>> ok = gen_server:call(Db#db.main_pid, {db_updated, NewDb3}, >>>> infinity), >>>> + ?LOG_DEBUG("after gen_server:call", []), >>> >>> the log gives: >>> >>>> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.391.0>] Compaction process >>>> spawned for db "asdf" >>>> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.84.0>] New task status for >>>> <0.391.0>: [{changes_done,1}, >>>> {database,<<"asdf">>}, >>>> {progress,100}, >>>> {started_on,1677753384}, >>>> {total_changes,1}, >>>> >>>> {type,database_compaction}, >>>> {updated_on,1677753384}] >>>> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.366.0>] CouchDB swapping files >>>> .../asdf.couch and .../asdf.couch.compact. >>>> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.366.0>] before gen_server:call >>> >>> then long time nothing... >>> >>> refreshing the db in the futon web gui gives: no response >>> >>> and the log continues with: >>> >>>> [Thu, 02 Mar 2023 11:02:54 GMT] [error] [<0.144.0>] ** Generic server >>>> couch_compaction_daemon terminating >>>> ** Last message in was {'EXIT',<0.145.0>, >>>> {timeout, >>>> {gen_server,call,[couch_server,get_server]}}} >>>> ** When Server state == {state,<0.145.0>} >>>> ** Reason for termination == >>>> ** {compaction_loop_died, >>>> {timeout,{gen_server,call,[couch_server,get_server]}}} >>>> >>>> [Thu, 02 Mar 2023 11:02:54 GMT] [error] [<0.144.0>] {error_report,<0.31.0>, >>>> {<0.144.0>,crash_report, >>>> [[{initial_call, >>>> {couch_compaction_daemon,init,['Argument__1']}}, >>>> {pid,<0.144.0>}, >>>> {registered_name,couch_compaction_daemon}, >>>> {error_info, >>>> {exit, >>>> {compaction_loop_died, >>>> {timeout, >>>> {gen_server,call,[couch_server,get_server]}}}, >>>> [{gen_server,terminate,7, >>>> [{file,"gen_server.erl"},{line,804}]}, >>>> {proc_lib,init_p_do_apply,3, >>>> [{file,"proc_lib.erl"},{line,237}]}]}}, >>> ... >>> >>> >>>> 3. Can you share your environment where you get to compile 1.6.1 >>>> successfully, so we can try and reproduce this? >>> >>> I could prepare you a yocto setup to build a toolchain and packages for >>> an qemu/docker imgage, if you are familar with that build system... >>> >>>> 4. Could it be that your SMB implementation doesn’t allow for opening >>>> and closing files in this quick succession (with our without a rename >>>> in the mix)? >>> >>> For testing it desn't need to run on SMB share, the timeout issue >>> occures with the given fd-swap patch on a default (Linux) setup. >>> >>> And a strace log does not show any underlying FS issues. >>> >>> >>> Best, >>> Stefan >>> >>> Am 28.02.23 um 16:47 schrieb Jan Lehnardt: >>>> first off, CouchDB 1.6.1 is no longer supported by this project AND it >>>> has a long list of CVEs[1] against it. You REALLY should be operating >>>> on a newer version. >>>> >>>> Secondly, just to understand your motivation: you think closing and >>>> opening the fds after the file:rename/2 call will make things work >>>> for your SMB operation? >>>> >>>> If yes, the only think I could spot that is substantially different, is >>>> that the NewFd position is advanced implicitly by the underlying >>>> file:pread/3 in [2] and your SwappedFd doesn’t get the same treatment, >>>> but I don’t know why that should block the gen server call, as that only >>>> does some refcounting updates[3]. While this includes stopping the >>>> gen_server[4], I don’t see how the Pid this operates on should be any >>>> different under your patch. >>>> >>>> So: >>>> >>>> 1. How did you verify that the gen_server:call/3 call never returns? >>>> 2. Do you get any pertinent lines (especially crashes) in your couch.log? >>>> 3. Can you share your environment where you get to compile 1.6.1 >>>> successfully, so we can try and reproduce this? >>>> 4. Could it be that your SMB implementation doesn’t allow for opening and >>>> closing files in this quick succession (with our without a rename in >>>> the mix)? >>>> >>>> >>>> [1]: https://docs.couchdb.org/en/stable/cve/index.html >>>> [2]: >>>> https://github.com/apache/couchdb/blob/1.6.x/src/couchdb/couch_db_updater.erl#L179 >>>> [3]: >>>> https://github.com/apache/couchdb/blob/1.6.x/src/couchdb/couch_db.erl#L1122-L1130 >>>> [4]: >>>> https://github.com/apache/couchdb/blob/1.6.x/src/couchdb/couch_ref_counter.erl#L84 >>>> >>>> >>>> Best >>>> Jan >>>> — >>>> Professional Support for Apache CouchDB: >>>> https://neighbourhood.ie/couchdb-support/ >>>> >>>> 24/7 Observation for your CouchDB Instances: >>>> https://opservatory.app >>>> >>>> >>>>> On 28. Feb 2023, at 10:19, Stefan Kral <stefan.k...@emlix.com> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I'm experimenting with a CouchDB setup on a SMB mount point. I know this >>>>> is not supported, but I ran into a (maybe simple) problem I don't >>>>> understand. Maybe someone of you can give a hint easily (that would be >>>>> amazing). >>>>> >>>>> Given the following patch (I need to close/reopen the file descriptors >>>>> after renaming) for the function >>>>> https://github.com/apache/couchdb/blob/1.6.x/src/couchdb/couch_db_updater.erl#L176 >>>>> >>>>>> 1 --- a/src/couchdb/couch_db_updater.erl >>>>>> 2 +++ b/src/couchdb/couch_db_updater.erl >>>>>> 3 @@ -202,8 +202,18 @@ handle_call({compact_done, CompactFilepath}, >>>>>> _From, #db{filepath=Path}=Db) -> >>>>>> 4 RootDir = couch_config:get("couchdb", "database_dir", "."), >>>>>> 5 couch_file:delete(RootDir, Filepath), >>>>>> 6 ok = file:rename(CompactFilepath, Filepath), >>>>>> 7 + >>>>>> 8 + ok = couch_file:close(NewDb#db.updater_fd), >>>>>> 9 + ok = couch_file:close(NewDb#db.fd), >>>>>> 10 + {ok, SwappedFd} = couch_file:open(Filepath), >>>>>> 11 + SwappedReaderFd = open_reader_fd(Filepath, Db#db.options), >>>>>> 12 + SwappedDb = NewDb2#db{ >>>>>> 13 + fd = SwappedReaderFd, >>>>>> 14 + updater_fd = SwappedFd >>>>>> 15 + }, >>>>>> 16 + unlink(SwappedFd), >>>>>> 17 close_db(Db), >>>>>> 18 - NewDb3 = refresh_validate_doc_funs(NewDb2), >>>>>> 19 + NewDb3 = refresh_validate_doc_funs(SwappedDb), >>>>>> 20 ok = gen_server:call(Db#db.main_pid, {db_updated, NewDb3}, >>>>>> infinity), >>>>>> 21 couch_db_update_notifier:notify({compacted, NewDb3#db.name}), >>>>>> 22 ?LOG_INFO("Compaction for db \"~s\" completed.", >>>>>> [Db#db.name]), >>>>> >>>>> then the gen_server:call() of line 20 never returns. >>>>> >>>>> Is there a major issue with this approach or just a minor mistake in my >>>>> implementation? >>>>> >>>>> >>>>> Thank you for having a look, >>>>> Stefan >>>> >>>> >> > > -- > Besuchen Sie uns auf der Embedded World 2023 > 14. bis 16. März 2023 | Messe Nürnberg > Sie finden uns in Halle 4, Stand 336 > > Dipl.-Ing. Stefan Kral, emlix GmbH, http://www.emlix.com > Fon +49 30 275911-00, Fax -33 > Panoramastraße 1, 10178 Berlin, Germany > Sitz der Gesellschaft: Göttingen, Amtsgericht Göttingen HR B 3160 > Geschäftsführung: Heike Jordan, Dr. Uwe Kracke > Ust.-IdNr.: DE 205 198 055 > > emlix - smart embedded open source