[
https://issues.apache.org/jira/browse/LUCENE-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975111#comment-13975111
]
Paul Elschot commented on LUCENE-5609:
--------------------------------------
Going from 4 to 16 for the 64 bit types is a very large step.
Wouldn't it be better to do that in more steps and only take a step from 4 to 8
now?
I think 11 is better than 12.
Both have an indexing cost of 3 indexed terms for 32 bits (10/11/11 and 8/12/12
precision bits per term).
11 should be faster at searching because it involves less terms. For a single
ended range, the expected number of terms for these cases is about half of:
{code} (2**10 + 2**11 + 2**11) < (2**8 + 2**12 + 2**12) {code}
Whether that difference is actually noticeable remains to be seen.
Independent of the precision step, geohashes from the spatial module might help
to avoid range subqueries that have large results.
> Should we revisit the default numeric precision step?
> -----------------------------------------------------
>
> Key: LUCENE-5609
> URL: https://issues.apache.org/jira/browse/LUCENE-5609
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Reporter: Michael McCandless
> Fix For: 4.9, 5.0
>
> Attachments: LUCENE-5609.patch
>
>
> Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
> numeric fields, but this is a pretty big hit on indexing speed and
> disk usage, especially for tiny documents, because it creates many (8
> or 16) terms for each value.
> Since we originally set these defaults, a lot has changed... e.g. we
> now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
> a faster postings format, etc.
> Index size is important because it limits how much of the index will
> be hot (fit in the OS's IO cache). And more apps are using Lucene for
> tiny docs where the overhead of individual fields is sizable.
> I used the Geonames corpus to run a simple benchmark (all sources are
> committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
> with these numeric fields:
> * lat/lng (double)
> * modified time, elevation, population (long)
> * dem (int)
> I tested 4, 8 and 16 precision steps:
> {noformat}
> indexing:
> PrecStep Size IndexTime
> 4 1812.7 MB 651.4 sec
> 8 1203.0 MB 443.2 sec
> 16 894.3 MB 361.6 sec
> searching:
> Field PrecStep QueryTime TermCount
> geoNameID 4 2872.5 ms 20306
> geoNameID 8 2903.3 ms 104856
> geoNameID 16 3371.9 ms 5871427
> latitude 4 2160.1 ms 36805
> latitude 8 2249.0 ms 240655
> latitude 16 2725.9 ms 4649273
> modified 4 2038.3 ms 13311
> modified 8 2029.6 ms 58344
> modified 16 2060.5 ms 77763
> longitude 4 3468.5 ms 33818
> longitude 8 3629.9 ms 214863
> longitude 16 4060.9 ms 4532032
> {noformat}
> Index time is with 1 thread (for identical index structure).
> The query time is time to run 100 random ranges for that field,
> averaged over 20 iterations. TermCount is the total number of terms
> the MTQ rewrote to across all 100 queries / segments, and it gets
> higher as expected as precStep gets higher, but the search time is not
> that heavily impacted ... negligible going from 4 to 8, and then some
> impact from 8 to 16.
> Maybe we should increase the int/float default precision step to 8 and
> long/double to 16? Or both to 16?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]