: We are using Solr 7.1.0 to index a database of addresses. We have found
: that our index size increases massively when we add one extra field to
: the index, even though that field is stored and not indexed, and doesn’t
what about docValues?
: When we run an index load without the problematic field present, the
: Solr index size is 5.5GB. When we add the field into the index, the
: size grows to 13.3GB. The field itself is a maximum of 46 characters in
: length and on average is 19 characters. We have ~14,000,000 rows in
: total to index of which only ~200,000 have this field present at all
: (i.e. not null in database). Given that we don’t want to index the
: field, only store it I would have thought (perhaps naively) that the
: storage increase would be approximately 200,000 * 19 = 3.8M bytes =
: 3.6MB rather than the 7.5GB we are seeing.
if the field has docValues enabled, then there will be some overhead for
every doc in the index -- even the ones that don't have a value in this
field. (allthough i'd still be very suprised if it accounted for 7G)
: - The problematic field is created through the API as follows:
:
: curl -X POST -H 'Content-type:application/json' --data-binary '{
: "add-field":{
: "name":"buildingName",
: "type":"string",
: "stored":true,
: "indexed":false
: }
: }' http://localhost:8983/solr/address/schema
...that's going to cause the field to inherit any (non-overridden)
settings from the fieldType "string" -- in the 7.1 _default configset,
"string" is defined with docValues="true"
You can see *all* properties set on a field -- regardless of wether they
are set on the fieldType, or are implicit hardcoded defaults in the
implementation of the fieldType via the 'showDefaults=true' Schema API
option.
Consider these API examples from the techproducts demo...
$ curl 'http://localhost:8983/solr/techproducts/schema/fields/cat'
{
"responseHeader":{
"status":0,
"QTime":0},
"field":{
"name":"cat",
"type":"string",
"multiValued":true,
"indexed":true,
"stored":true}}
$ curl
'http://localhost:8983/solr/techproducts/schema/fields/cat?showDefaults=true'
{
"responseHeader":{
"status":0,
"QTime":0},
"field":{
"name":"cat",
"type":"string",
"indexed":true,
"stored":true,
"docValues":false,
"termVectors":false,
"termPositions":false,
"termOffsets":false,
"termPayloads":false,
"omitNorms":true,
"omitTermFreqAndPositions":true,
"omitPositions":false,
"storeOffsetsWithPositions":false,
"multiValued":true,
"large":false,
"sortMissingLast":true,
"required":false,
"tokenized":false,
"useDocValuesAsStored":true}}
-Hoss
http://www.lucidworks.com/