Hi,

The search code works pretty much like you described. I was surprised to find 
out that in v2.2 it still didn't work correctly, but it was quite a small fix: 
http://hg.dovecot.org/dovecot-2.2/rev/4b0a736bc40c

On 11.10.2013, at 9.07, Rick van Rein (OpenFortress) <r...@openfortress.nl> 
wrote:

> Hello,
> 
> I love Dovecot, but when developing a small IMAP tool, I ran into searching 
> behaviour can easily be optimised.  Please forgive a rather detailed 
> suggestion.  This was on Dovecot 1.2.15 on Debian Squeeze.
> 
> My tool?  It's called "midget" and retrieves documents from an IMAP box based 
> on their mid: or cid: identifier, as per RFC 2392.  I thought this would be 
> useful to retrieve email attachments into a remote shell environment.  When 
> using Kerberos, the credentials and a strong hint of one's mail address are 
> already present anyway, DNS SRV does the rest.  If you want to see the early 
> code, I'll be happy to share it here.
> 
> The three formats of the URIs in RFC 2392 are:
> * mid:messageid
> * cid:contentid
> * mid:messageid/contentid
> 
> Searching for a mid: with only  a Message-ID part is quick and probably 
> indexed,
> 
> (HEADER Message-ID <messageid>)
> 
> Searching for a cid: meant ploughing through body text to get to the 
> attachments, so it was a slow and costly and needed checking of the outcome:
> 
> (TEXT <contentid>)
> 
> The quicker URI form to get to a Content-Id should be to mention it at the 
> end of a mid: URI, and search for
> 
> (HEADER Message-ID <messageid>)(TEXT <conentid>)
> 
> Much to my surprise, this took about as long as the cid: query -- ploughing 
> through lots of body text while it only needed to look through a one message 
> bodies!  As a human, I can see a straightforward improvement.  However, it is 
> more difficult to find a general solution -- but that is what I'm describing 
> below, in the hope that it will help to improve Dovecot's search performance.
> 
> 
> 0. In general, searches are Boolean expressions composed with implied AND and 
> explicit (NOT …) and (OR ….) constructs.
> 
> 1. Using the theorems of The Morgan, push all (NOT …) constructs inside as 
> far as possible, until they surround elementary statements.  Some may have a 
> trivial solution, such as (NOT (ALL)).  Others will retain the NOT but that 
> might be handled with a different use of the indexes.  In general, this phase 
> turns all AND and OR constructs into positive logic, without duplicating 
> search terms.
> 
> 2. Estimate the effort involved in every part of the calculation.  (HEADER) 
> is lighter than (TEXT) and headers may be further split into with / without 
> index.  Larger sets could be more work to search than smaller ones.  In 
> general, an estimate of the number of comparisons could be a good metric for 
> work to be done.
> 
> 3. For AND compositions, start with the simplest one in the AND compostition, 
> and continue with increasingly heavier ones.  Apply later conditions only on 
> the output of the previous conditions, either after collecting all of it or 
> everytime something is found.  Whatever is optimal -- bulk operations are 
> more efficient, but may collect large sets to handle.
> 
> 4. For OR compositions, either start a thread for each part, or switch focus 
> and collect sorted streams into one collectively sorted stream.
> 
> I've been doing this stuff all through my PhD thesis work, so it come easy to 
> me :-)  But this optimisation does not appear to be implemented yet, so I 
> thought I'd suggest it.  I hope this is useful to Dovecot!
> 
> 
> Best wishes,
> 
> Rick van Rein
> 
> 

Reply via email to