Re: Efficient string lookup using Lucene

2012-08-24 Thread Ahmet Arslan
> search for a string "run", I do not need to find "ran" but I > do want to find it in all of these strings below: > > Fox is running fast > !%#^&$run!$!%@&$# > run,run With NGramFilter you can do that. But it creates a lot of tokens. For example "Fox is running fast" becomes F o

Re: Efficient string lookup using Lucene

2012-08-24 Thread Jack Krupansky
I can't speak for any non-Latin languages, but how about simply using the StandardAnalyzer plus the EdgeNGramFilter for indexing (but not query.) The latter would allow a query of "run" to match "running". -- Jack Krupansky -Original Message- From: Ilya Zavorin Sent: Friday, August 2

Re: Efficient string lookup using Lucene

2012-08-24 Thread Dawid Weiss
What you need is a suffix tree or a suffix array. Both data structures will allow you to perform constant-time searches for existence/ occurrence of any input pattern. Depending on how much text you have on the input it may either be a simple task -- see here: http://labs.carrotsearch.com/jsuffixa

Efficient string lookup using Lucene

2012-08-24 Thread Ilya Zavorin
Hi Everyone, I have the following task. I have a set of documents in multiple languages. I don't know what these languages are. Any given doc may contain text in several languages mixed up. So to me these are just a bunch of Unicode text files. What I need is to implement an efficient EXACT str

RE: Problem with TermVector offsets and positions not being preserved

2012-08-24 Thread Mike O'Leary
So for Lucene 3.6, is the right way to do this to create a new Document and add new Fields based on the old Fields (with the settings you want them to have for term vector offsets and positions, etc.) and then call updateDocument on that new Document? Thanks, Mike -Original Message- Fro

Re: Problem with TermVector offsets and positions not being preserved

2012-08-24 Thread Robert Muir
Calling IR.document does not restore your 'original Document' completely. This is really an age-old trap. So don't update documents this way: its fine to fetch their contents but nothing goes thru the effort to ensure that things like term vectors parameters are the same as what you originally prov