Hi Ziqi, Lucene has support for sentence chunking - see SegmentingTokenizerBase, implemeented in ThaiTokenizer and HMMChineseTokenizer. There is an example in that class’s tests that creates tokens out of individual sentences: TestSegmentingTokenizerBase.WholeSentenceTokenizer.
However, it sounds like you only need to store the sentences, not search against them, so I don’t think you need sentence *tokenization*. why not simply use the JDK’s BreakIterator (or as you say OpenNLP) to do sentence splitting and add to the doc as stored fields? Steve www.lucidworks.com > On Sep 23, 2015, at 11:39 AM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk> wrote: > > Thanks that is understood. > > My application is a bit special in the way that I need both an indexed field > with standard tokenization and an unindexed but stored field of sentences. > Both must be present for each document. > > I could possibly do with PatternTokenizer, but that is of course, less > accurate than e.g., wrapping OpenNLP sentence splitter in a lucene Tokenizer. > > > > On 23/09/2015 16:23, Doug Turnbull wrote: >> Sentence recognition is usually an NLP problem. Probably best handled >> outside of Solr. For example, you probably want to train and run a sentence >> recognition algorithm, inject a sentence delimiter, then use that delimiter >> as the basis for tokenization. >> >> More info on sentence recognition >> http://opennlp.apache.org/documentation/manual/opennlp.html >> >> On Wed, Sep 23, 2015 at 11:18 AM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk> >> wrote: >> >>> Hi >>> >>> I need a special kind of 'token' which is a sentence, so I need a >>> tokenizer that splits texts into sentences. >>> >>> I wonder if there is already such or similar implementations? >>> >>> If I have to implement it myself, I suppose I need to implement a subclass >>> of Tokenizer. Having looked at a few existing implementations, it does not >>> look very straightforward how to do it. A few pointers would be highly >>> appreciated. >>> >>> Many thanks >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> > > > -- > Ziqi Zhang > Research Associate > Department of Computer Science > University of Sheffield > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org