Quoting Eric Abrahamsen <e...@ericabrahamsen.net>:

Michael M Slusarz <slus...@curecanti.org> writes:

Quoting Eric Abrahamsen <e...@ericabrahamsen.net>:

While I've got you here, I hope you'll answer one more question: what's
the format for searching multiple terms with non-ascii strings? Is it
possible in one run to find a utf-8 encoded subject, and a utf-8 encoded
body?

IMAP interaction would look like this:

C: . UID SEARCH CHARSET UTF-8 SUBJECT {4}
S: +OK
C: aéb BODY {4}
S: +OK
C: aéb
S: * SEARCH XXX
S: . OK

Even better... if the server supports LITERAL+, you don't have to wait
for the synchronizing literal which prevents the need to wait for 2
round-trips from the server:

C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+}
C: aéb BODY {4+}
C: aéb[CRLF]
S: * SEARCH XXX
S: . OK

michael

One other question:

I've set up full text search indexing via Lucene, and it works great.
But how is this index encoded? Specifically, if I use the above method
to search for non-ascii strings, am I still benefiting from the speedups
of the search index?

I know that some people who are indexing non-ascii, non-UTF-8 messages
are running them through some sort of decoder to force them into UTF-8,
so that Lucene can index them properly. Is this still necessary if I'm
using the method above?

I have no insight on Lucene internals.

michael

Reply via email to