On Nov 8, 2010, at 3:28 PM, Randall Leeds wrote:
> On Mon, Nov 8, 2010 at 12:22, Paul Davis <[email protected]> wrote:
>> On Mon, Nov 8, 2010 at 3:04 PM, Randall Leeds <[email protected]>
>> wrote:
>>> Whoops. Hit send too early, but I think I got everything in there that
>>> I wanted to say.
>>>
>>> As for the ref counter bottleneck, I just pushed to
>>> https://github.com/tilgovi/couchdb/tree/ets_ref_count
>>> This branch uses a public ets for the ref_counter. I think I managed
>>> to linear the updates over the {total, RefCtr} keys in the ets table
>>> such that there should be no race conditions but please, please take a
>>> look at this if you have time.
>>>
>>> It seems to pass the ref_counter tests, but I still need to handle
>>> giving away ownership of the ets table. Right now I use couch_server
>>> as the heir so I can use only one ETS table for all couch_ref_counter
>>> processes, but the couch_server just crashes if it actually receives
>>> the 'ETS-TRANSFER' message. If I can't find an easy way to hand the
>>> table to another couch_ref_counter whenever the owner exits I may just
>>> break the encapsulation of the module a bit by leaving couch_server as
>>> the owner and ignoring that message.
>>>
>>> Thanks, guys. My gut says we're going to get some nice numbers when
>>> all this is done.
>>>
>>> -Randall
>>>
>>> On Mon, Nov 8, 2010 at 11:56, Randall Leeds <[email protected]> wrote:
>>>> Thanks to both of you for getting this conversation going again and
>>>> for the work on the patch and testing, Filipe.
>>>>
>>>> On Sun, Nov 7, 2010 at 12:49, Adam Kocoloski <[email protected]> wrote:
>>>>> On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote:
>>>>>
>>>>>> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <[email protected]>
>>>>>> wrote:
>>>>>>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
>>>>>>>
>>>>>>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <[email protected]>
>>>>>>>> wrote:
>>>>>>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>>>>>>>>
>>>>>>>>>> Also, with this patch I verified (on Solaris, with the 'zpool iostat
>>>>>>>>>> 1' command) that when running a writes only test with relaximation
>>>>>>>>>> (200 write processes), disk write activity is not continuous. Without
>>>>>>>>>> this patch, there's continuous (every 1 second) write activity.
>>>>>>>>>
>>>>>>>>> I'm confused by this statement. You must be talking about
>>>>>>>>> relaximation runs with delayed_commits = true, right? Why do you
>>>>>>>>> think you see larger intervals between write activity with the
>>>>>>>>> optimization from COUCHDB-767? Have you measured the time it takes
>>>>>>>>> to open the extra FD? In my tests that was a sub-millisecond
>>>>>>>>> operation, but maybe you've uncovered something else.
>>>>>>>>
>>>>>>>> No, it happens for tests with delayed_commits = false. The only
>>>>>>>> possible explanation I see for the variance might be related to the
>>>>>>>> Erlang VM scheduler decisions about when to start/run that process.
>>>>>>>> Nevertheless, I dont know the exact cause, but the fsync run frequency
>>>>>>>> varies a lot.
>>>>>>>
>>>>>>> I think it's worth investigating. I couldn't reproduce it on my
>>>>>>> plain-old spinning disk MacBook with 200 writers in relaximation; the
>>>>>>> IOPS reported by iostat stayed very uniform.
>>>>>>>
>>>>>>>>>> For the goal of not having readers getting blocked by fsync calls
>>>>>>>>>> (and
>>>>>>>>>> write calls), I would propose using a separate couch_file process
>>>>>>>>>> just
>>>>>>>>>> for read operations. I have a branch in my github for this (with
>>>>>>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaximation
>>>>>>>>>> tests are very positive, both reads and writes get better response
>>>>>>>>>> times and throughput:
>>>>>>>>>>
>>>>>>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads
>>>>>>>>>
>>>>>>>>> I'd like to propose an alternative optimization, which is to keep a
>>>>>>>>> dedicated file descriptor open in the couch_db_updater process and
>>>>>>>>> use that file descriptor for _all_ IO initiated by the db_updater.
>>>>>>>>> The advantage is that the db_updater does not need to do any message
>>>>>>>>> passing for disk IO, and thus does not slow down when the incoming
>>>>>>>>> message queue is large. A message queue much much larger than the
>>>>>>>>> number of concurrent writers can occur if a user writes with
>>>>>>>>> batch=ok, and it can also happen rather easily in a BigCouch cluster.
>>>>>>>>
>>>>>>>> I don't see how that will improve things, since all write operations
>>>>>>>> will still be done in a serialized manner. Since only couch_db_updater
>>>>>>>> writes to the DB file, and since access to the couch_db_updater is
>>>>>>>> serialized, to me it only seems that you're solution avoids one level
>>>>>>>> of indirection (the couch_file process). I don't see how, when using a
>>>>>>>> couch_file only for writes, you get the message queue for that
>>>>>>>> couc_file process full of write messages.
>>>>>>>
>>>>>>> It's the db_updater which gets a large message queue, not the
>>>>>>> couch_file. The db_updater ends up with a big backlog of update_docs
>>>>>>> messages that get in the way when it needs to make gen_server calls to
>>>>>>> the couch_file process for IO. It's a significant problem in R13B,
>>>>>>> probably less so in R14B because of some cool optimizations by the OTP
>>>>>>> team.
>>>>>>
>>>>>> So, let me see if I get it. The couch_db_updater process is slow
>>>>>> picking the results of the calls to the couch_file process because its
>>>>>> mailbox is full of update_docs messages?
>>>>>
>>>>> Correct. Each call to the couch_file requires a selective receive on the
>>>>> part of the db_updater in order to get the response, and prior to R14
>>>>> that selective receive needed to match against every message in the
>>>>> mailbox. It's really a bigger problem in couch_server, which uses a
>>>>> gen_server call to increment a reference counter before handing the #db{}
>>>>> to the client, since every request to any DB has to talk to couch_server
>>>>> first. Best,
>>>>>
>>>>> Adam
>>>>
>>>> Adam,
>>>> I think the problem is made worse by a backed up db_updater, but the
>>>> db_updater becomes backed up because it makes more synchronous calls
>>>> to the couch_file than a reader does, handling only one update
>>>> operation at a time while readers queue up on the couch_file in
>>>> parallel.
>>>>
>>>> Filipe,
>>>> Using a separate fd for writes at the couch_file level is not the
>>>> answer. The db_updater has to read the btree before it can write,
>>>> incurring multiple trips through the couch_file message queue between
>>>> queuing append_term requests and processing its message queue for new
>>>> updates. Using two file descriptors keeps the readers out of the way
>>>> of the writers only if you select which fd to use at the db-operation
>>>> level and not the file-operation level. Perhaps two couch_file
>>>> processes is better. Fairness should be left to the operating system
>>>> I/O scheduler once reads don'. This seems seems like the best way
>>>> forward to me right now. Let's try to crunch some numbers on it soon.
>>>>
>>>> I couldn't find a solution I liked that was fair to readers and
>>>> writers at any workload with only one file descriptor. The btree cache
>>>> alleviates this problem a bit because the read path becomes much
>>>> faster and therefore improves database reads and writes.
>>>>
>>>> As to the patch, I'd think we need the readers and writers separated
>>>> into two separate couch_files. That way the updater can perform its
>>>> reads on the "writer" fd, otherwise writers suffer starvation because
>>>> readers go directly into the couch_file queue in parallel instead of
>>>> serializing through something like db_updater.
>>>>
>>>
>>
>> Wasn't there a branch or patch somehwere that just removed the
>> ref_counter code entirely and used monitors/links to make sure
>> everything behaved correctly? I'm not sure I ever saw it to see how
>> dramatic and/or scary it was, but it might be another approach to
>> consider.
>>
>
> Adam should chime in. I think BigCouch got rid of the ref counter in
> favor of something else, but last I asked him about it he said there
> might be a small edge case race condition. How critical that is I
> can't say. It may be that that edge case is tolerable and recoverable.
BigCouch uses a protected couch_dbs ets table to store all open #dbs and a
public couch_lru ets table for LRU updates. Reference counting is accomplished
by having clients monitor and demonitor the couch_file. The least recently
used database is determined by folding over the couch_lru table. The race
condition Randall is referring to is the following:
1) client process grabs a #db{} record from the couch_dbs ets table
2) couch_server calculates that the LRU DB is the one the client just looked up
3) couch_server checks the monitored_by field for the #db.fd process and finds
no clients are monitoring it
4) couch_server kills the DB
5) client updates the LRU time (too late)
6) client monitors the #db.fd, but it's already dead
Practically speaking, it's been a non-issue, but it does exist. For what it's
worth, a similar race condition existed in older (0.10.x and below) versions of
CouchDB, where the client would increment the ref counter after receiving the
DB from couch_server. The current implementation in CouchDB avoids the race
condition by having couch_server increment the ref counter on behalf of the
client, but that approach has serious performance and stability implications
for BigCouch.
I'll try to take a look at Randall's github branch tonight. Using monitors for
reference counting feels really right to me, but as it's not possible to
monitor a process on behalf of someone else I'm not sure it's possible to avoid
the race that I described with monitors. Regards,
Adam