[
https://issues.apache.org/jira/browse/LUCENE-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975183#comment-13975183
]
Uwe Schindler commented on LUCENE-5609:
---------------------------------------
bq. Have a look at LUCENE-1470, even 2 was considered then.
That was not really useable even at that time! The improvements in contrast to
4 were zero. It was even worse (because the term dictionary got larger, which
had impact in 2.x and 3.x. At that time, I was always using 8 as precisionStep
for longs and ints. The same applied for Solr. Lucene was the only one using 4
as default. ElasticSearch was cloning Lucene's standards.
I would really prefer to use 8 for both ints and longs. The change from 8 to 16
is increasing the number of terms immense and the index size between 8 and 16
is not really a problem. To me it has also shown that because of the way how
floats/doubles are encoded, the precision step of 8 is really good for longs.
In most cases stuff never changes (like exponent), so there is exactly one term
indexed for that.
With a precision step of 16 I would imagine the differences between 16 and 64
would be neglectible, too :-) The main reason for having lower precision steps
are indexes were the values are equally distributed. For stuff like values
clustered around some numbers, the precisionstep is irrelevant! In most cases
because the way how it works, for larger shifts the indexed value is constant,
so you have one or 2 terms that hit all documents and are never used by the
range query..
So before changing the default, I would suggest to have a test with an index
that has equally distributed numbers of the full 64 bit range.
bq. I think 11 is better than 12
...because the last term is better used. The number of terms indexed is the
same for 11 and 12 (6*11=66, 6*12=72, but 5*12=60 is too small). But
unfortunately this is not a multiple of 4, so would not be backwards compatible.
I think the main problem of this issue is, that we only have *one* default.
Sombeody never doing any ranges does not need the additional terms at all.
That's the main problem. Solr is better here, as it provided 2 predefined field
types, but Lucene only has one - and that is the bug.
So my proposal: Provide a 2nd field type as a 2nd default with correct
documnetation, suggesting it to users, only wanting to index numeric
identifiers or non-docvalues fields they want to sort on.
And second, we should do LUCENE-5605 - I started with it last week, but was
interrupted by _NativeFSIndexCorrumpter_ :-) The problem is the precisionStep
alltogether! We should make it an implementation detail. When constructing a
NRQ, you should not need to pass it. Because of this I opened LUCENE-5605, so
anybody creating a NRQ/NRF should pass the FieldType to the NRQ ctor, not an
arbitrary number. Then its ensured that the people use the same settings for
indexing and querying.
Together with this, we should provide 2 predfined field types per data type and
remove the constant from NumericUtils completely. The 2 field types per data
type might be named like DEFAULT_INT_FOR_RANGEQUERY_FILEDTYPE and
DEFAULT_INT_OTHERIWSE_FIELDTYPE (please choose better names and javadocs). And
we should make 8 the new default, which is fully backwards compatible. And hide
the precision step completely! 16 is really too large for lots of queries. And
difference in index size is neglectibale, unless you have a purely numeric
index (in which case you should use a RDBMS instead of an Lucene index to query
your data :-) !). Indexing time is also, as Mike discovered not a problem at
all. If people don't reuse the IntField instance, its always as slow, because
the TokenStream has to be recreated on every number. The number of terms is not
the issue at all, sorry!
About ElasticSearch: Unfortunately the schemaless mode of ElasticSearch always
uses 4 as precStep if it detects a numeric or date type. ES should change this,
but maybe have a bit more intelligent "guessing". E.g., If you index the "_id"
field as an integer, it should automatically use infinite
(DEFAULT_INT_OTHERIWSE_TYPE) precStep - nobody would do range queries on the
"_id" field. For all standard numeric fields it should use precstep=8.
> Should we revisit the default numeric precision step?
> -----------------------------------------------------
>
> Key: LUCENE-5609
> URL: https://issues.apache.org/jira/browse/LUCENE-5609
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Reporter: Michael McCandless
> Fix For: 4.9, 5.0
>
> Attachments: LUCENE-5609.patch
>
>
> Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
> numeric fields, but this is a pretty big hit on indexing speed and
> disk usage, especially for tiny documents, because it creates many (8
> or 16) terms for each value.
> Since we originally set these defaults, a lot has changed... e.g. we
> now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
> a faster postings format, etc.
> Index size is important because it limits how much of the index will
> be hot (fit in the OS's IO cache). And more apps are using Lucene for
> tiny docs where the overhead of individual fields is sizable.
> I used the Geonames corpus to run a simple benchmark (all sources are
> committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
> with these numeric fields:
> * lat/lng (double)
> * modified time, elevation, population (long)
> * dem (int)
> I tested 4, 8 and 16 precision steps:
> {noformat}
> indexing:
> PrecStep Size IndexTime
> 4 1812.7 MB 651.4 sec
> 8 1203.0 MB 443.2 sec
> 16 894.3 MB 361.6 sec
> searching:
> Field PrecStep QueryTime TermCount
> geoNameID 4 2872.5 ms 20306
> geoNameID 8 2903.3 ms 104856
> geoNameID 16 3371.9 ms 5871427
> latitude 4 2160.1 ms 36805
> latitude 8 2249.0 ms 240655
> latitude 16 2725.9 ms 4649273
> modified 4 2038.3 ms 13311
> modified 8 2029.6 ms 58344
> modified 16 2060.5 ms 77763
> longitude 4 3468.5 ms 33818
> longitude 8 3629.9 ms 214863
> longitude 16 4060.9 ms 4532032
> {noformat}
> Index time is with 1 thread (for identical index structure).
> The query time is time to run 100 random ranges for that field,
> averaged over 20 iterations. TermCount is the total number of terms
> the MTQ rewrote to across all 100 queries / segments, and it gets
> higher as expected as precStep gets higher, but the search time is not
> that heavily impacted ... negligible going from 4 to 8, and then some
> impact from 8 to 16.
> Maybe we should increase the int/float default precision step to 8 and
> long/double to 16? Or both to 16?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]