[ 
https://issues.apache.org/jira/browse/LUCENE-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973001#comment-13973001
 ] 

Michael McCandless commented on LUCENE-5609:
--------------------------------------------

Another test, this time on a biggish (~1B docs) simulated timestamps
over a 1 day range with msec precision (benchmark sources are in
luceneutil):

{noformat}
indexing:

PrecStep        Size    IndexTime
       4      6.7 GB     2482 sec
       8      4.0 GB     2229 sec
      16      2.8 GB     2213 sec
{noformat}

Curiously index speed did not change much (this test only indexes the
one LongField), but I did re-use the LongField instance across
addDocument calls, and I used 12 indexing threads.

{noformat}
searching:

PrecStep     QueryTime         TermCount
       4      29.7 sec        1285 terms
       8      29.9 sec       11216 terms
      16      30.6 sec     1410453 terms
{noformat}

Much smaller slowdown in the queries as precStep is increased ... I
suspect because the numbers are so "dense" and many docs have the
timestamp, but still a big reduction in index size.

Net/net I think we should increase the default precStep?


> Should we revisit the default numeric precision step?
> -----------------------------------------------------
>
>                 Key: LUCENE-5609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5609
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Michael McCandless
>             Fix For: 4.9, 5.0
>
>
> Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
> numeric fields, but this is a pretty big hit on indexing speed and
> disk usage, especially for tiny documents, because it creates many (8
> or 16) terms for each value.
> Since we originally set these defaults, a lot has changed... e.g. we
> now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
> a faster postings format, etc.
> Index size is important because it limits how much of the index will
> be hot (fit in the OS's IO cache).  And more apps are using Lucene for
> tiny docs where the overhead of individual fields is sizable.
> I used the Geonames corpus to run a simple benchmark (all sources are
> committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
> with these numeric fields:
>   * lat/lng (double)
>   * modified time, elevation, population (long)
>   * dem (int)
> I tested 4, 8 and 16 precision steps:
> {noformat}
> indexing:
> PrecStep        Size        IndexTime
>        4   1812.7 MB        651.4 sec
>        8   1203.0 MB        443.2 sec
>       16    894.3 MB        361.6 sec
> searching:
>      Field  PrecStep   QueryTime   TermCount
>  geoNameID         4   2872.5 ms       20306
>  geoNameID         8   2903.3 ms      104856
>  geoNameID        16   3371.9 ms     5871427
>   latitude         4   2160.1 ms       36805
>   latitude         8   2249.0 ms      240655
>   latitude        16   2725.9 ms     4649273
>   modified         4   2038.3 ms       13311
>   modified         8   2029.6 ms       58344
>   modified        16   2060.5 ms       77763
>  longitude         4   3468.5 ms       33818
>  longitude         8   3629.9 ms      214863
>  longitude        16   4060.9 ms     4532032
> {noformat}
> Index time is with 1 thread (for identical index structure).
> The query time is time to run 100 random ranges for that field,
> averaged over 20 iterations.  TermCount is the total number of terms
> the MTQ rewrote to across all 100 queries / segments, and it gets
> higher as expected as precStep gets higher, but the search time is not
> that heavily impacted ... negligible going from 4 to 8, and then some
> impact from 8 to 16.
> Maybe we should increase the int/float default precision step to 8 and
> long/double to 16?  Or both to 16?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to