[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759449#comment-13759449
 ] 

Michael McCandless commented on LUCENE-3069:
--------------------------------------------

Patch looks great.  It's nice how postings writers no longer need
their own redundant PendingTerm instances to track the term's metadata
/ blocking; just use their existing TermState class instead.  And how
postings readers don't have to deal w/ blocking either.

In general, couldn't the writer re-use the reader's TermState?
E.g. Lucene40PostingsWriter just use Lucene40PostingsReader's
StandardTermState, rather than make its own?  (And same for
Lucene41PostingsWriter/Reader).

Have you run "first do no harm" perf tests?  Ie, compare current trunk
w/ default Codec to branch w/ default Codec?  Just to make sure there
are no surprises...

Why does Lucene41PostingsWriter have "impersonation" code?  Was that
just for debugging during dev?  Can we remove it (it should always
write the current format)?  The reader needs it of course ... but it
shouldn't be commented as "impersonation" but as back-compat?

In the javadocs for encodeTerm, don't we require that the long[] are
always monotonic?  It's not "optional"?  Also, "monotonical" should be
"monotonic" there.

Maybe we should add a "reset" method to each PF's TermState, so
instead of doing newTermState() when absolute, we can .reset(), and
likewise in the reader.

I forget: why does the postings reader/writer need to handle delta
coding again (take an absolute boolean argument)?  Was it because of
pulsing or sep?  It's fine for now (progress not perfection) ... but
not clean, since "delta coding" is really an encoding detail so in
theory the terms dict should "own" that ...

"monotonical" appears several times but I think it should instead be
"monotonic".

The new .smy file for Pulsing is sort of strange ... but necessary
since it always uses 0 longs, so we have to store this somewhere
... you could put it into FieldInfo attributes instead?

It's nice how small the FST terms dicts are!  Much simpler than the
hairy BlockTree code...

Should we backport this to 4.x?  In theory this should not be so hard
... 3.x indices already have their own PF impls, and the change is
back-compatible to current 4.x indices ...

                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 5.0, 4.5
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to