Re: Truncated text during Xapian indexing

Robert Stepanek Thu, 15 Feb 2018 07:14:01 -0800

On Thu, Feb 15, 2018, at 13:08, Sebastian Hagedorn wrote:
> Is the setting "search_skipdiacrit" in imapd.conf honored during the 
> indexing or is that only relevant while searching? Given your comment 
> regarding search normalization above I take it Umlaut characters are not 
> considered diacriticals? It's not a huge issue, but as a German university 
> it would be nice for our users if a search could distinguish between 
> "hatte" and "hätte", as an example.


Cyrus considers Umlaut characters as diacriticals (I was just handwaving that 
away in my previous comment due to the default settings). The skip_diacrit 
setting applies to both indexing and search.

As an example, let's append two emails to a mailbox. The body of message 1 
contains the German verb "gären". Message 2 contains the verb "garen" (for the 
non-German speakers: these verbs mean two different things).

With skip_diacrit set to true (the default), this is what lands in the Xapian 
database:

   [...] Zgaren garen

and searches for "garen" and "gären" will both match both messages.

With skip_diacrit set to false, however, we get

  [...] Zgaren Zgären garen gären

and searches for "garen" and "gären" will only match the respective messages.

I uploaded a new test to Cassandane that demonstrates this [1] (the 
subject_isutf8 test case might also be of interest). I'd just deactivate 
search_skipdiacrit if you are sure that your users will benefit from it. If in 
doubt, I would rather err on the safe side and return false positives by 
skipping diacritics (the default).

There's more to say about the Z prefixes: Cyrus currently uses the English 
stemmer for all text, resulting in stem terms that typically match their 
non-stemmed original input for non-English text. While this might seem odd, 
it's the best we can do without proper language detection for both indexing and 
search. I implemented multi-language stem support in an experimental feature 
branch, but didn't resolve the issues around fingerprinting search queries, 
yet. There's an open issue to track this [2].

[1] 
https://github.com/cyrusimap/cassandane/blob/master/Cassandane/Cyrus/SearchFuzzy.pm#L403
[2] https://github.com/cyrusimap/cyrus-imapd/issues/72

> Just out of curiosity, how is the mapping between a Xapian docid and a 
> message file on disk achieved? I played around with xapian-delve and the 
> Perl example simplesearch.pl. When I search a term, I get a list of 
> docid's, but how do I know which message that is?

In 3.x, Cyrus search stores an internal unique message id, called guid, as 
docid in Xapian. The guid currently is a SHA-1 hash of the raw message, 
allowing for deduplication and to avoid re-indexing already seen messages. The 
conversations.db of a user maps this guid to a list of mailbox:UID pairs.

Off the top of my head, there currently isn't an "official" way in Cyrus to 
retrieve the mailbox:UID list for a given guid outside the Cyrus process. 
Depending on your use case, you could either: 1.) build your custom mapper on 
imap/conversations.h, 2.) use cvt_cyrusdb to dump the contents of a 
conversations.db into plain text. Or 3.) use the JMAP layer to fetch 
JMAP-formatted message or the raw message blob by id. For JMAP email, use the 
guid and prefix it with 'M' in an Email/get method. For blobs, use 'G' as 
prefix. Both are "unofficial": we might change the JMAP id scheme in future 
releases. But I guess this isn't going to happen any time soon, if ever.

Hope it helps,
Robert
----
Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
To Unsubscribe:
https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus

Re: Truncated text during Xapian indexing

Reply via email to