How can I use a byte array as a Lucene index field?

2015-09-23 Thread Larry White
I'm using Lucene 4.10.3. (I plan to upgrade soon but need to fix an issue on this version today). I switched a Lucene index from using string document ids to byte arrays. The problem I'm having is that the system no longer finds documents by their id. I *suspect* this is because the lucene code is

tokenize into sentences/sentence splitter

2015-09-23 Thread Ziqi Zhang
Hi I need a special kind of 'token' which is a sentence, so I need a tokenizer that splits texts into sentences. I wonder if there is already such or similar implementations? If I have to implement it myself, I suppose I need to implement a subclass of Tokenizer. Having looked at a few exist

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Doug Turnbull
Sentence recognition is usually an NLP problem. Probably best handled outside of Solr. For example, you probably want to train and run a sentence recognition algorithm, inject a sentence delimiter, then use that delimiter as the basis for tokenization. More info on sentence recognition http://open

Re: How can I use a byte array as a Lucene index field?

2015-09-23 Thread Michael McCandless
You are indexing your field as a StoredField which means it's not actually indexed (just stored), so no query (nor IW.deleteDocument) will ever be able to find it. Try StringField instead ... in recent versions you can pass a BytesRef value to that. Mike McCandless http://blog.mikemccandless.com

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Ziqi Zhang
Thanks that is understood. My application is a bit special in the way that I need both an indexed field with standard tokenization and an unindexed but stored field of sentences. Both must be present for each document. I could possibly do with PatternTokenizer, but that is of course, less ac

Re: How can I use a byte array as a Lucene index field?

2015-09-23 Thread Larry White
i upgraded to 5.3 and fixed as you suggested. thank you. On Wed, Sep 23, 2015 at 11:31 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > You are indexing your field as a StoredField which means it's not > actually indexed (just stored), so no query (nor IW.deleteDocument) > will ever b

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Steve Rowe
Hi Ziqi, Lucene has support for sentence chunking - see SegmentingTokenizerBase, implemeented in ThaiTokenizer and HMMChineseTokenizer. There is an example in that class’s tests that creates tokens out of individual sentences: TestSegmentingTokenizerBase.WholeSentenceTokenizer. However, it

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Ziqi Zhang
Thanks Steve. It probably also makes sense to extract sentences and then store them. But along with each sentence i also need to store its start/end offset. I'm not sure how to do that without creating a separate index that stores each sentence as a document? Basically the field for sentence a

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Ziqi Zhang
Further to this problem, I have created a custom tokenizer but I cannot get it loaded properly by solr. The error stacktrace: Exception in thread "main" org.apache.solr.common.SolrException: SolrCore 'myproject' is not available due to init failure: Could not load c

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Steve Rowe
Unless you need to be able search on sentences-as-terms, i.e. exact sentence matching, you should try to find an alternative; otherwise your term index will be unnecessarily huge. Three things come to mind: 1. A single Lucene index can host mixed document types, e.g. full documents and sentenc