On Tue, Oct 15, 2013 at 10:11 AM, Robert Muir <rcm...@gmail.com> wrote:
> On Tue, Oct 15, 2013 at 9:59 AM, Michael McCandless
> <luc...@mikemccandless.com> wrote:
>> Well, unfortunately, this is a trap that users do hit.
>>
>> By requiring the user to think about the limit on creating
>> PostingsHighlighter, he/she would think about it and realize they are
>> in fact setting a limit.
>>
>> Silent limits are dangerous because you don't offhand know what's
>> wrong / why you see nothing getting highlighted.
>>
>>
>
> I already made my argument: for 99% of use cases the defaults are
> fine. In most cases highlighting is trying to summarize the document
> and something that deep just doesnt contribute much (see the default
> scoring model!). There is an optional ctor for the others doing expert
> things to specify the length.
>
> I don't think we should make APIs unusable because you think XYZ is a trap.

How would this make the APIs unusable?

I don't think requiring the user to set the truncation (a single int
parameter) up front is "unusable"?

Instead, it's making it clear that this class silently discards tokens
from the document, which I think is dangerous for any class to
silently do.  The user needs to think about what to pass, and realize
what they pass means truncation is happening.

> Why not make DEFAULT_MAX_THREAD_STATES a required parameter to indexwriter?

I think that's quite different: that param is for optimizing how many
threads can run concurrently in IndexWriter, and there are lots of
other parameters you could tune if you want to try to speed things up.
 It's not about discarding tokens, which is a change in functionality
and very different.

Long ago, IndexWriter used to do something very similar: it would
silently discard all tokens after the first 10,000 by default.  But
that was horribly trappy, and so we made it a required ctor parameter.
 Now, finally, we've removed it entirely and you can use
LimitTokenCountFilter if you want to truncate before indexing.

> Hell lets make it so users have to supply all parameters to
> everything, so everything is like
> IndexWriter(int,int,int,int,int,int,int,int,int,int,int,int) and so
> on. Then you will be satisfied there are no traps, but it will be
> totally unusable.

I agree that would be unusable, but that's not what I'm proposing;
it's not so black and white.

I do agree with you that we need to keep our APIs very minimal, and
that every added parameter is an added cost.  But we need to balance
that with settings that do nasty things, like truncate tokens; I think
it's fair in such cases to consider making them an explicit choice in
the API.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to