Re: lucene deliberately removes \r (windows carriage char)

2015-10-03 Thread Ziqi Zhang
we Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Ziqi Zhang [mailto:ziqi.zh...@sheffield.ac.uk] Sent: Saturday, October 03, 2015 5:01 PM To: java-user@lucene.apache.org Subject: lucene deliberately removes \r (windows car

Re: java.io.IOException: Map failed

2015-10-03 Thread Ziqi Zhang
o its applied on system startup. This depends on your Linux distribution, we cannot give any help on this. I would also recommend to review my blog post as stated with URL in the exception message! Uwe Am 1. Oktober 2015 21:25:30 MESZ, schrieb Ziqi Zhang : Hi, I have a problem wh

lucene deliberately removes \r (windows carriage char)

2015-10-03 Thread Ziqi Zhang
Hi I am trying to pin-point a mismatch between the offsets produced by lucene indexing process when I use the offsets to substring from the original document content. I try to debug as far as I can go but I lost track of lucene when I am at line 298 of DefaultIndexingChain (lucene 5.3.0):

java.io.IOException: Map failed

2015-10-01 Thread Ziqi Zhang
Hi, I have a problem which I think is the same as that described here: http://stackoverflow.com/questions/8892143/error-when-opening-a-lucene-index-map-failed However the solution does not apply in this case so I am providing more details and asking again. The index is created using Solr 5.3

token n-gram: leading and trailing stopwords removal only

2015-09-25 Thread Ziqi Zhang
Hi Is there a way to remove just the leading and trailing stopwords from a token n-gram? Currently I have the following combination which removes any n-gram that contains a stopword: ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

Re: tokenize into sentences/sentence splitter

2015-09-24 Thread Ziqi Zhang
ago sentences (stored): 33,52|Something happened. Steve www.lucidworks.com On Sep 23, 2015, at 3:26 PM, Ziqi Zhang wrote: Thanks Steve. It probably also makes sense to extract sentences and then store them. But along with each sentence i also need to store its start/end offset. I

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Ziqi Zhang
se.WholeSentenceTokenizer. However, it sounds like you only need to store the sentences, not search against them, so I don’t think you need sentence *tokenization*. why not simply use the JDK’s BreakIterator (or as you say OpenNLP) to do sentence splitting and add to the doc as stored fields? Steve www

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Ziqi Zhang
P) to do sentence splitting and add to the doc as stored fields? Steve www.lucidworks.com On Sep 23, 2015, at 11:39 AM, Ziqi Zhang wrote: Thanks that is understood. My application is a bit special in the way that I need both an indexed field with standard tokenization and an unindexed but s

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Ziqi Zhang
a sentence delimiter, then use that delimiter as the basis for tokenization. More info on sentence recognition http://opennlp.apache.org/documentation/manual/opennlp.html On Wed, Sep 23, 2015 at 11:18 AM, Ziqi Zhang wrote: Hi I need a special kind of 'token' which is a sentence, s

tokenize into sentences/sentence splitter

2015-09-23 Thread Ziqi Zhang
Hi I need a special kind of 'token' which is a sentence, so I need a tokenizer that splits texts into sentences. I wonder if there is already such or similar implementations? If I have to implement it myself, I suppose I need to implement a subclass of Tokenizer. Having looked at a few exist

offsets of a term in a document

2015-09-21 Thread Ziqi Zhang
Hi Given a document in a lucene index, I would like to get a list of terms in that document and their offsets. I suppose starting with IndexReader.getTermVector can get me going with this. I have some code as below (Lucene 5.3) of which I have some questions:

Re: get frequency of each term from a document

2015-09-20 Thread Ziqi Zhang
, you need to enable term vectors during indexing. The pattern how to use terms enum can be looked up at various places in Lucene source code. It's a very expert API but it is the way to go here. Uwe Am 20. September 2015 15:35:40 MESZ, schrieb Ziqi Zhang : Hi Is it possible to get a li

get frequency of each term from a document

2015-09-20 Thread Ziqi Zhang
Hi Is it possible to get a list of terms within a document, and also TF of each of these terms *in that document only*? (Lucene 5.3) IndexReader has a method "Terms getTermVector(int docID, String field)", which gives me a "Terms" object, on which I can get a TermsEnum. But I do not know whe

how to write this SOLR query in LUCENE api?

2015-09-17 Thread Ziqi Zhang
Hi I am using the TermsComponent in my solr config like this to deal with queries about terms in the index: -- true false terms --- For example, I want to fetch any *terms* containing "surface defects". Using solr I can d