mmmm reading this : " *unindexed* but stored field of sentences" Unindexed immediately points to me to the fact you actually do not need a tokeniser at all. Just run an external sentence splitter ( in your indexing application), and store the sentences as different values for a stored field. Why this is not going to work for you ?
Cheers 2015-09-24 9:39 GMT+01:00 Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>: > Thanks for the comprehensive explanation, I think option 3 best fit my app. > > > > > On 23/09/2015 22:53, Steve Rowe wrote: > >> Unless you need to be able search on sentences-as-terms, i.e. exact >> sentence matching, you should try to find an alternative; otherwise your >> term index will be unnecessarily huge. >> >> Three things come to mind: >> >> 1. A single Lucene index can host mixed document types, e.g. full >> documents and sentences. >> >> 2. Nested documents, in Lucene's join module, could help, depending on >> what you need to do. Parent documents could correspond to original full >> documents, and sentences could be stored fields in child documents. The >> sentence offsets could be separate child document fields, maybe also >> stored-only, depending on search/sort requirements. See < >> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html>, >> < >> http://lucene.apache.org/core/5_3_0/join/org/apache/lucene/search/join/ToParentBlockJoinQuery.html> >> the tests for ToParentBlockJoinQuery for example usages: < >> http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_5_3_0/lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoin.java >> >. >> >> 3. Or, more simply, just store the offsets as a prefix inline with the >> stored sentence field values, e.g. >> >> original text: Four score and seven years ago.... Something >> happened. >> words (indexed): Four, score, and, seven, years, ago, Something, >> happened >> sentences (stored): 0,31|Four score and seven years ago.... >> sentences (stored): 33,52|Something happened. >> >> Steve >> www.lucidworks.com >> >> On Sep 23, 2015, at 3:26 PM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk> >>> wrote: >>> >>> Thanks Steve. >>> >>> It probably also makes sense to extract sentences and then store them. >>> But along with each sentence i also need to store its start/end offset. I'm >>> not sure how to do that without creating a separate index that stores each >>> sentence as a document? Basically the field for sentence and the field for >>> terms should be in the same index. >>> >>> Thanks >>> >>> >>> >>> On 23/09/2015 19:08, Steve Rowe wrote: >>> >>>> Hi Ziqi, >>>> >>>> Lucene has support for sentence chunking - see SegmentingTokenizerBase, >>>> implemeented in ThaiTokenizer and HMMChineseTokenizer. There is an example >>>> in that class’s tests that creates tokens out of individual sentences: >>>> TestSegmentingTokenizerBase.WholeSentenceTokenizer. >>>> >>>> However, it sounds like you only need to store the sentences, not >>>> search against them, so I don’t think you need sentence *tokenization*. >>>> >>>> why not simply use the JDK’s BreakIterator (or as you say OpenNLP) to >>>> do sentence splitting and add to the doc as stored fields? >>>> >>>> Steve >>>> www.lucidworks.com >>>> >>>> On Sep 23, 2015, at 11:39 AM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk> >>>>> wrote: >>>>> >>>>> Thanks that is understood. >>>>> >>>>> My application is a bit special in the way that I need both an indexed >>>>> field with standard tokenization and an unindexed but stored field of >>>>> sentences. Both must be present for each document. >>>>> >>>>> I could possibly do with PatternTokenizer, but that is of course, less >>>>> accurate than e.g., wrapping OpenNLP sentence splitter in a lucene >>>>> Tokenizer. >>>>> >>>>> >>>>> >>>>> On 23/09/2015 16:23, Doug Turnbull wrote: >>>>> >>>>>> Sentence recognition is usually an NLP problem. Probably best handled >>>>>> outside of Solr. For example, you probably want to train and run a >>>>>> sentence >>>>>> recognition algorithm, inject a sentence delimiter, then use that >>>>>> delimiter >>>>>> as the basis for tokenization. >>>>>> >>>>>> More info on sentence recognition >>>>>> http://opennlp.apache.org/documentation/manual/opennlp.html >>>>>> >>>>>> On Wed, Sep 23, 2015 at 11:18 AM, Ziqi Zhang < >>>>>> ziqi.zh...@sheffield.ac.uk> >>>>>> wrote: >>>>>> >>>>>> Hi >>>>>>> >>>>>>> I need a special kind of 'token' which is a sentence, so I need a >>>>>>> tokenizer that splits texts into sentences. >>>>>>> >>>>>>> I wonder if there is already such or similar implementations? >>>>>>> >>>>>>> If I have to implement it myself, I suppose I need to implement a >>>>>>> subclass >>>>>>> of Tokenizer. Having looked at a few existing implementations, it >>>>>>> does not >>>>>>> look very straightforward how to do it. A few pointers would be >>>>>>> highly >>>>>>> appreciated. >>>>>>> >>>>>>> Many thanks >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>> >>>>>>> >>>>>>> -- >>>>> Ziqi Zhang >>>>> Research Associate >>>>> Department of Computer Science >>>>> University of Sheffield >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> -- >>> Ziqi Zhang >>> Research Associate >>> Department of Computer Science >>> University of Sheffield >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > -- > Ziqi Zhang > Research Associate > Department of Computer Science > University of Sheffield > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- -------------------------- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England