I’m using Dovecot FTS with the flatcurve backend in a mailcow: dockerized setup. When searching for an email address with a hyphenated local-part (e.g., m...@example.com), the email-address tokenizer splits the local-part on hyphens, producing tokens like ma, g, and m...@example.com. This prevents searching for ma-g as a single term. With fts_flatcurve_substring_search = yes, searching for ma-g matches unrelated addresses containing ma (e.g., mana...@example.com), leading to irrelevant results. Dovecot version: 2.3.21.1 (d492236fa0) Including only relevant part of dovecot config: plugin { fts = flatcurve fts_autoindex = yes fts_autoindex_exclude = \Junk fts_autoindex_exclude2 = \Trash fts_autoindex_max_recent_msgs = 999999 fts_tokenizers = generic email-address fts_tokenizer_email_address = maxlen=100 fts_tokenizer_generic = algorithm=simple maxlen=100 fts_flatcurve_substring_search = yes fts_languages = en es de ru fts_filters = normalizer-icu snowball stopwords fts_filters_en = lowercase snowball english-possessive stopwords fts_filters_ru = lowercase snowball stopwords fts_index_timeout = 300s } service indexer-worker { process_limit = 12 vsz_limit = 512 MB }
Steps to Reproduce: Index an email with m...@example.com in the From field also index email contains "ma" and "g" in the From field. Check tokenization: doveadm fts tokenize -u u...@example.com "m...@example.com" Output: ma g example com m...@example.com Search: doveadm search -u u...@example.com FROM ma-g Results include mana...@example.com due to ma matching. Expected Behavior: FROM ma-g should match only emails with m...@example.com, treating ma-g as a single term or exact local-part. Expected tokens: doveadm fts tokenize -u u...@example.com "m...@example.com" Output: ma-g ma g example com m...@example.com Actual Behavior: The tokenizer splits ma-g into ma and g. Substring search matches "ma" or "g" in unrelated addresses (e.g., mana...@example.com, g...@example.com). Without substring search, ma-g matches nothing unless searching the full address. Impact: Searching hyphenated local-parts for short email address local-parts is unreliable, especially for common fragments like ma, flooding results with irrelevant matches. Request: Add a configuration option, such as "fts_tokenizer_email_address_keep_hyphenated = yes|no" (default: no, for compatibility), to include the hyphenated local-part of an email address as an additional token. For example, with "yes", tokenizing "m...@example.com" would produce "ma-g", "ma", "g", "example", "com", and "m...@example.com". This allows searches for "FROM ma-g" to match emails with "m...@example.com" exactly, while preserving "ma" and "g" for substring searches. Consider "yes" as a future default, as including hyphenated local-parts aligns with RFC 5322 and user expectations for precise email searches, especially for common hyphenated addresses like "first-l...@domain.com". If changing defaults, provide upgrade notes for users relying on the current token set. Is there any workaround to search hyphenated local-parts accurately? Best regards, Daniel Levin _______________________________________________ dovecot mailing list -- dovecot@dovecot.org To unsubscribe send an email to dovecot-le...@dovecot.org