Hi, The search code works pretty much like you described. I was surprised to find out that in v2.2 it still didn't work correctly, but it was quite a small fix: http://hg.dovecot.org/dovecot-2.2/rev/4b0a736bc40c
On 11.10.2013, at 9.07, Rick van Rein (OpenFortress) <r...@openfortress.nl> wrote: > Hello, > > I love Dovecot, but when developing a small IMAP tool, I ran into searching > behaviour can easily be optimised. Please forgive a rather detailed > suggestion. This was on Dovecot 1.2.15 on Debian Squeeze. > > My tool? It's called "midget" and retrieves documents from an IMAP box based > on their mid: or cid: identifier, as per RFC 2392. I thought this would be > useful to retrieve email attachments into a remote shell environment. When > using Kerberos, the credentials and a strong hint of one's mail address are > already present anyway, DNS SRV does the rest. If you want to see the early > code, I'll be happy to share it here. > > The three formats of the URIs in RFC 2392 are: > * mid:messageid > * cid:contentid > * mid:messageid/contentid > > Searching for a mid: with only a Message-ID part is quick and probably > indexed, > > (HEADER Message-ID <messageid>) > > Searching for a cid: meant ploughing through body text to get to the > attachments, so it was a slow and costly and needed checking of the outcome: > > (TEXT <contentid>) > > The quicker URI form to get to a Content-Id should be to mention it at the > end of a mid: URI, and search for > > (HEADER Message-ID <messageid>)(TEXT <conentid>) > > Much to my surprise, this took about as long as the cid: query -- ploughing > through lots of body text while it only needed to look through a one message > bodies! As a human, I can see a straightforward improvement. However, it is > more difficult to find a general solution -- but that is what I'm describing > below, in the hope that it will help to improve Dovecot's search performance. > > > 0. In general, searches are Boolean expressions composed with implied AND and > explicit (NOT …) and (OR ….) constructs. > > 1. Using the theorems of The Morgan, push all (NOT …) constructs inside as > far as possible, until they surround elementary statements. Some may have a > trivial solution, such as (NOT (ALL)). Others will retain the NOT but that > might be handled with a different use of the indexes. In general, this phase > turns all AND and OR constructs into positive logic, without duplicating > search terms. > > 2. Estimate the effort involved in every part of the calculation. (HEADER) > is lighter than (TEXT) and headers may be further split into with / without > index. Larger sets could be more work to search than smaller ones. In > general, an estimate of the number of comparisons could be a good metric for > work to be done. > > 3. For AND compositions, start with the simplest one in the AND compostition, > and continue with increasingly heavier ones. Apply later conditions only on > the output of the previous conditions, either after collecting all of it or > everytime something is found. Whatever is optimal -- bulk operations are > more efficient, but may collect large sets to handle. > > 4. For OR compositions, either start a thread for each part, or switch focus > and collect sorted streams into one collectively sorted stream. > > I've been doing this stuff all through my PhD thesis work, so it come easy to > me :-) But this optimisation does not appear to be implemented yet, so I > thought I'd suggest it. I hope this is useful to Dovecot! > > > Best wishes, > > Rick van Rein > >