Hello,

I love Dovecot, but when developing a small IMAP tool, I ran into searching 
behaviour can easily be optimised.  Please forgive a rather detailed 
suggestion.  This was on Dovecot 1.2.15 on Debian Squeeze.

My tool?  It's called "midget" and retrieves documents from an IMAP box based 
on their mid: or cid: identifier, as per RFC 2392.  I thought this would be 
useful to retrieve email attachments into a remote shell environment.  When 
using Kerberos, the credentials and a strong hint of one's mail address are 
already present anyway, DNS SRV does the rest.  If you want to see the early 
code, I'll be happy to share it here.

The three formats of the URIs in RFC 2392 are:
 * mid:messageid
 * cid:contentid
 * mid:messageid/contentid

Searching for a mid: with only  a Message-ID part is quick and probably indexed,

(HEADER Message-ID <messageid>)

Searching for a cid: meant ploughing through body text to get to the 
attachments, so it was a slow and costly and needed checking of the outcome:

(TEXT <contentid>)

The quicker URI form to get to a Content-Id should be to mention it at the end 
of a mid: URI, and search for

(HEADER Message-ID <messageid>)(TEXT <conentid>)

Much to my surprise, this took about as long as the cid: query -- ploughing 
through lots of body text while it only needed to look through a one message 
bodies!  As a human, I can see a straightforward improvement.  However, it is 
more difficult to find a general solution -- but that is what I'm describing 
below, in the hope that it will help to improve Dovecot's search performance.


0. In general, searches are Boolean expressions composed with implied AND and 
explicit (NOT …) and (OR ….) constructs.

1. Using the theorems of The Morgan, push all (NOT …) constructs inside as far 
as possible, until they surround elementary statements.  Some may have a 
trivial solution, such as (NOT (ALL)).  Others will retain the NOT but that 
might be handled with a different use of the indexes.  In general, this phase 
turns all AND and OR constructs into positive logic, without duplicating search 
terms.

2. Estimate the effort involved in every part of the calculation.  (HEADER) is 
lighter than (TEXT) and headers may be further split into with / without index. 
 Larger sets could be more work to search than smaller ones.  In general, an 
estimate of the number of comparisons could be a good metric for work to be 
done.

3. For AND compositions, start with the simplest one in the AND compostition, 
and continue with increasingly heavier ones.  Apply later conditions only on 
the output of the previous conditions, either after collecting all of it or 
everytime something is found.  Whatever is optimal -- bulk operations are more 
efficient, but may collect large sets to handle.

4. For OR compositions, either start a thread for each part, or switch focus 
and collect sorted streams into one collectively sorted stream.

I've been doing this stuff all through my PhD thesis work, so it come easy to 
me :-)  But this optimisation does not appear to be implemented yet, so I 
thought I'd suggest it.  I hope this is useful to Dovecot!


Best wishes,

Rick van Rein


Reply via email to