Re: interpreting scores

2009-05-08 Thread Nate
Wow Karl, thank you so much for writing this up! It was a great help! I have the ngram tokenizing working as you described. Searches are very good! In order to verify the hits are of high quality, I use the Smith-Waterman algorithm. Other approximate string comparisons I evaluated didn't work well

Re: 'problem with indexformat and luke

2009-05-08 Thread Grant Ingersoll
This usually means that your index was created using a newer version of Lucene than is bundled with Luke. You will need to get the Luke minimal jars (no Lucene) and use that along with the Lucene versions you have. On May 8, 2009, at 12:42 PM, Timon Roth wrote: hello list i am using luc

Re: 'problem with indexformat and luke

2009-05-08 Thread Matthew Hall
Which version of luke are you using? Timon Roth wrote: hello list i am using lucene 2.9. when i try to open the index with luke i got an error: unknown format version: -8 any hints? - To unsubscribe, e-mail: java-user-unsubs

'problem with indexformat and luke

2009-05-08 Thread Timon Roth
hello list i am using lucene 2.9. when i try to open the index with luke i got an error: unknown format version: -8 any hints? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: j

Re: Re: I got the score "0.3044460713 863373" for the cosine similarity of two do cument with the same text content !!

2009-05-08 Thread Kamal Najib
Thank you for the Replay, i have got it. Kamal. Original Message: What does the searcher.explain() method say? -Grant On May 6, 2009, at 2:18 AM, Kamal Najib wrote: > hi, > thanks for the reply.see: http://lucene.apache.org/java/2_4_1/api/index.html > you will find there the Similarity have cr

RE: Lucene Index Encryption

2009-05-08 Thread Peter_Lenahan
You are correct, other vulnerabilities will of course be the Swap file, which is much easier to dump than the memory contents, since it may persist even when the process dies or the machine is turned off, and of course a process dump or snapshot file. In either case, those cracks would be on a sys

Re: Lucene Index Encryption

2009-05-08 Thread patrick o'leary
There will always be levels of where data will be insecurely available. Most notably within the memory of an application once it's running. Unless you want to go down the path of encrypting and decrypting each and every string. At which point you loose dictionary functionality and well any useful e

Re: Lucene Index Encryption

2009-05-08 Thread Karl Wettin
I might be missing something here, but why not just store the index on a cryptographic virtual file system? karl 8 maj 2009 kl. 19.09 skrev >: Michael, Thanks for the comments they are very insightful. I hadn't thought about the Random Access issues until you brought it up. T

Re: Lucene Index Encryption

2009-05-08 Thread Peter_Lenahan
Michael, Thanks for the comments they are very insightful. I hadn't thought about the Random Access issues until you brought it up. This makes the project a little tougher, but not impossible. I was searching last night and there have been a couple of papers written on the topic of Encrypt

Re: interpreting scores

2009-05-08 Thread Karl Wettin
8 maj 2009 kl. 13.13 skrev Nate: Is it possible to get a count for how many terms a result matched? Currently I think you can only do that by using Searcher.explain(). But that is not a very nice solution. A better solution is beeing worked on and might be available in a few months or so.

Re: interpreting scores

2009-05-08 Thread Karl Wettin
Ngrams can be use for lots of stuff. In your case it has nothing to do with spellchecking, it was the "until" vs. "'till" that made me think of them as they would allow you to get at least partial matching of the text. Also, ngrams gives you a bit of phrase functionallity. Create the grams

RE: RegexQuery Incomplete Results

2009-05-08 Thread Steven A Rowe
On 5/8/2009 at 9:13 AM, Ian Lee wrote: > I'm surprised that it matches either - don't you need ".*in" where .* > means match any character zero or more times? See the javadoc for > java.util.regex.Pattern, or for Jakarta Regexp if you are using that > package. > > Unless you're an expert in regex

Re: RegexQuery Incomplete Results

2009-05-08 Thread Ian Lea
I'm surprised that it matches either - don't you need ".*in" where .* means match any character zero or more times? See the javadoc for java.util.regex.Pattern, or for Jakarta Regexp if you are using that package. Unless you're an expert in regexps it is probably worth playing with them outside y

Re: RegexQuery Incomplete Results

2009-05-08 Thread Erick Erickson
I don't understand your regex at all. Isn't it looking for in with any *single* character in front and back? Given your example, I don't see how you're getting anything back at all. Is this code you're actually executing or just an example? What does toString and/or Explain show? Think about getti

Re: Stemming

2009-05-08 Thread Matthew Hall
Ganesh wrote: My opinion is Stemming process is to get the base word. Here it is not doing so. Unfortunately this is where your problem lies, stemming doesn't do this, it breaks words that are almost lexically equivalent down into a similar root word. thus cat = cats. From the wiki: "*Stemm

RegexQuery Incomplete Results

2009-05-08 Thread Huntsman84
Hi, I am using RegexQuery for searching in a set of records wich are phrases of several words each. My aim is to find any phrase that contains the given group of letters (e.g. "in"). For that case, I am building the query with the regular expression ".in.", so it should return all phrases with co

Stemming

2009-05-08 Thread Ganesh
Hello all, I am using Lucene 2.4.1 and Snowball Analyzer for my indexing. I am facing some issues with stemming. Raining stemmed to Rain cats stemmed to cat but Harder is not stemmed to hard Stronger is not stemmed to Strong. Even Keyword and Standard analyzer does the same. My opinion is Stemm

Re: interpreting scores

2009-05-08 Thread Nate
Is it possible to get a count for how many terms a result matched? Googling, it doesn't appear to be done easily. I tried it out by breaking my query into words myself, then doing a search for each one and keeping track of the results and counts. This way I know if 4 out of 5 terms matched a docume

Re: why setPhraseSlop() not helping

2009-05-08 Thread Seid Mohammed
Thanks Erick it solves On Thu, May 7, 2009 at 8:13 PM, Erick Erickson wrote: > You haven't forced the double quotes through to the parser. Try > Query query = qp.parse("\"word1 word2\""); > > On Thu, May 7, 2009 at 11:14 AM, Seid Mohammed wrote: > >> I have set the slop for my search to be some