Dovecot - FTS Solr: disk usage & position information?

Vincent Brillault Wed, 04 Aug 2021 00:25:14 -0700

Dear all,

On a local dovecot cluster currently hosting roughly 2.1TB of data,
using Solr as its FTS backend, we now have 256GB of data in Solr, split
in 12 shard (to which replication adds 256GB of data through 12
additional cores).

I'm now trying to see if we can optimize that data. Looking at one core
at random (22G), I see that the data is split mostly between
- .pos files: 12G
- .tim files: 4.2G
- .doc files: 3.8G
- .cfs files: 1.8G

Looking around a bit, I found
https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/codecs/lucene50/Lucene50PostingsFormat.html
(which is unfortunately a bit outdated I think) that explains each file
content:
- .tim: Term Dictionary
- .tip: Term Index
- .doc: Frequencies and Skip Data
- .pos: Positions
- .pay: Payloads and Offsets

So clearly the file naming convention have changed, but still if .pos is
really position information ("lists of positions that each term occurs
at within documents."), this sounds rather useless for the dovecot
integration.

Looking at Solr documentation on search
(https://solr.apache.org/guide/8_6/the-standard-query-parser.html) it
seems that position aware query are written as `"term1 term2"~[0-9]+`.
Looking at the dovecot code
(https://github.com/dovecot/core/blob/master/src/plugins/fts-solr/fts-backend-solr.c),
I don't see this kind of query being made, `~` only being used for fuzzy
search.

Has anyone ever tried to set omitTermFreqAndPositions or omitPositions
to true for the text fields in the Solr Schema? It sounds that this
could improve a lot the disk space used by Solr without losing any
feature. The only thing I'm not too clear about is the
"autoGeneratePhraseQueries" which is enabled in
https://github.com/dovecot/core/blob/master/doc/solr-schema-7.7.0.xml.

Thanks in advance,
Vincent Brillault

PS: I have attached the schema we are using for completeness. It's based
on the one in the dovecot repo, with a bit of simplification for headers
that don't really require as much massaging.

<?xml version="1.0" encoding="UTF-8"?>

<schema name="dovecot" version="2.0">
  <fieldType name="string" class="solr.StrField" omitNorms="true" sortMissingLast="true"/>
  <fieldType name="long" class="solr.LongPointField" positionIncrementGap="0"/>

  <fieldType name="text_basic" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" splitOnNumerics="1" catenateAll="1" catenateWords="1"/>
      <filter class="solr.FlattenGraphFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

  <fieldType name="text" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" splitOnNumerics="1" catenateAll="1" catenateWords="1"/>
      <filter class="solr.FlattenGraphFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.FlattenGraphFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" splitOnNumerics="1" catenateAll="1" catenateWords="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
  </fieldType>

  <field name="id" type="string" indexed="true" required="true" stored="true"/>
  <field name="uid" type="long" indexed="true" required="true" stored="true"/>
  <field name="box" type="string" indexed="true" required="true" stored="true"/>
  <field name="user" type="string" indexed="true" required="true" stored="true"/>

  <field name="hdr" type="text_basic" indexed="true" stored="false"/>
  <field name="body" type="text" indexed="true" stored="false"/>

  <field name="from" type="text_basic" indexed="true" stored="false"/>
  <field name="to" type="text_basic" indexed="true" stored="false"/>
  <field name="cc" type="text_basic" indexed="true" stored="false"/>
  <field name="bcc" type="text_basic" indexed="true" stored="false"/>
  <field name="subject" type="text" indexed="true" stored="false"/>

  <!-- Used by Solr internally: -->
  <field name="_version_" type="long" indexed="true" stored="true"/>

  <uniqueKey>id</uniqueKey>
</schema>

OpenPGP_signature
Description: OpenPGP digital signature

Dovecot - FTS Solr: disk usage & position information?

Reply via email to