If you'd like to join in on the doc, see https://github.com/apache/lucene-solr/pull/14/files. I'd be happy to grant you access to push to my fork.
On Wed, Jan 8, 2014 at 5:37 AM, Mindaugas Žakšauskas <min...@gmail.com>wrote: > Just for the interest, I had a similar problem too as well as other > people [1]. In my project, I am extending the Tokenizer class and have > another tokenizer (e.g. ClassicTokenizer) as a delegate. > Unfortunately, properly overriding all public/protected methods is > *not* enough, e.g.: > > public void reset() throws IOException { > super.reset(); > delegate.reset(); > } > > I was still getting the exception of broken read()/close() contract. > Half day and *lots* of debugging later, I realized that exception is > only thrown when indexing second document only as the delegate reader > internally gets replaced with ILLEGAL_STATE_READER after .close() is > called. My solution to this problem was to make the reset() method > like this: > > public void reset() throws IOException { > super.reset(); > delegate.setReader(input); > delegate.reset(); > } > > Another thing worth mentioning is that it's crucial to have > super.method() before delegate.method() in all overridden methods. > Would be nice if all of this was somewhere in the Tokenizer Javadoc, > or even nicer if the base class was designed with delegation in mind > (Effective Java (2nd edition), Item 16). > > Hope this helps somebody. > > [1] > http://stackoverflow.com/questions/20624339/having-trouble-rereading-a-lucene-tokenstream/20630673#20630673 > > Regards, > Mindaugas > > On Tue, Jan 7, 2014 at 9:45 PM, Benson Margulies <ben...@basistech.com> > wrote: > > Yes I Do. > > > > > > On Tue, Jan 7, 2014 at 3:59 PM, Robert Muir <rcm...@gmail.com> wrote: > > > >> Benson, do you want to open an issue to fix this constructor to not > >> take Reader? (there might be one already, but lets make a new one). > >> > >> These things are supposed to be reused, and have setReader for that > >> purpose. i think its confusing and contributes to bugs that you have > >> to have logic in e.g. the ctor THEN ALSO in reset(). > >> > >> if someone does it correctly in the ctor, but they only test "one > >> time", they might think everything is working.. > >> > >> On Tue, Jan 7, 2014 at 3:23 PM, Benson Margulies <ben...@basistech.com> > >> wrote: > >> > For the record of other people who implement tokenizers: > >> > > >> > Say that your tokenizer has a constructor, like: > >> > > >> > public MyTokenizer(Reader reader, ....) { > >> > super(reader); > >> > myWrappedInputDevice = new MyWrappedInputDevice(reader); > >> > } > >> > > >> > Not a good idea. Tokenizer carefully manages the data flow from the > >> > constructor arg to the 'input' field. The correct form is: > >> > > >> > public MyTokenizer(Reader reader, ....) { > >> > super(reader); > >> > myWrappedInputDevice = new MyWrappedInputDevice(this.input); > >> > } > >> > > >> > > >> > > >> > On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir <rcm...@gmail.com> wrote: > >> > > >> >> See Tokenizer.java for the state machine logic. In general you should > >> >> not have to do anything if the tokenizer is well-behaved (e.g. close > >> >> calls super.close() and so on). > >> >> > >> >> > >> >> > >> >> On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies < > bimargul...@gmail.com > >> > > >> >> wrote: > >> >> > In 4.6.0, > >> >> > org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException > >> >> > > >> >> > fails if incrementToken fails to throw if there's a missing reset. > >> >> > > >> >> > How am I supposed to organize this in a Tokenizer? A quick look at > >> >> > CharTokenizer did not reveal any code for the purpose. > >> >> > > >> >> > > --------------------------------------------------------------------- > >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> > > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> > >> >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >