I’m using Dovecot FTS with the flatcurve backend in a mailcow: dockerized setup.
When searching for an email address with a hyphenated local-part (e.g., 
m...@example.com), the email-address tokenizer splits the local-part on 
hyphens, producing tokens like ma, g, and m...@example.com. This prevents 
searching for ma-g as a single term. With fts_flatcurve_substring_search = yes, 
searching for ma-g matches unrelated addresses containing ma (e.g., 
mana...@example.com), leading to irrelevant results.
Dovecot version: 2.3.21.1 (d492236fa0)
Including only relevant part of dovecot config:
plugin {  fts = flatcurve
    fts_autoindex = yes
    fts_autoindex_exclude = \Junk
    fts_autoindex_exclude2 = \Trash
    fts_autoindex_max_recent_msgs = 999999
    fts_tokenizers = generic email-address
    fts_tokenizer_email_address = maxlen=100
    fts_tokenizer_generic = algorithm=simple maxlen=100
    fts_flatcurve_substring_search = yes
    fts_languages = en es de ru
    fts_filters = normalizer-icu snowball stopwords
    fts_filters_en = lowercase snowball english-possessive stopwords
    fts_filters_ru = lowercase snowball stopwords
    fts_index_timeout = 300s
}
service indexer-worker {
    process_limit = 12
    vsz_limit = 512 MB
}



Steps to Reproduce:
Index an email with m...@example.com in the From field also index email 
contains "ma" and "g" in the From field.

Check tokenization:

doveadm fts tokenize -u u...@example.com "m...@example.com"

Output:

ma
g
example
com
m...@example.com

Search:

doveadm search -u u...@example.com FROM ma-g

Results include mana...@example.com due to ma matching.

Expected Behavior:
FROM ma-g should match only emails with m...@example.com, treating ma-g as a 
single term or exact local-part.
Expected tokens:
doveadm fts tokenize -u u...@example.com "m...@example.com"

Output:
ma-g
ma
g
example
com
m...@example.com
Actual Behavior:
The tokenizer splits ma-g into ma and g. Substring search matches "ma" or "g" 
in unrelated addresses (e.g., mana...@example.com, g...@example.com). Without 
substring search, ma-g matches nothing unless searching the full address.
Impact:
Searching hyphenated local-parts for short email address local-parts is 
unreliable, especially for common fragments like ma, flooding results with 
irrelevant matches.
Request:
Add a configuration option, such as 
"fts_tokenizer_email_address_keep_hyphenated = yes|no" (default: no, for 
compatibility), to include the hyphenated local-part of an email address as an 
additional token. For example, with "yes", tokenizing "m...@example.com" would 
produce "ma-g", "ma", "g", "example", "com", and "m...@example.com". This 
allows searches for "FROM ma-g" to match emails with "m...@example.com" 
exactly, while preserving "ma" and "g" for substring searches. Consider "yes" 
as a future default, as including hyphenated local-parts aligns with RFC 5322 
and user expectations for precise email searches, especially for common 
hyphenated addresses like "first-l...@domain.com". If changing defaults, 
provide upgrade notes for users relying on the current token set.

Is there any workaround to search hyphenated local-parts accurately?

Best regards,
Daniel Levin




_______________________________________________
dovecot mailing list -- dovecot@dovecot.org
To unsubscribe send an email to dovecot-le...@dovecot.org

Reply via email to