RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
On 3/26/2012 at 12:21 PM, Ilya Zavorin wrote: > I am not seeing anything suspicious. Here's what I see in the HEX: > > "n.e" from "pain.electricity": 6E-2E-0D-0A-0D-0A-65 > (n-.-CR-LF-CR-LF-e) "e.H" from "sentence.He": 65-2E-0D-0A-48 I agree, standard DOS/Windows line endings. > I am pretty sure

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Ilya Zavorin
Steve, I am not seeing anything suspicious. Here's what I see in the HEX: "n.e" from "pain.electricity": 6E-2E-0D-0A-0D-0A-65 (n-.-CR-LF-CR-LF-e) "e.H" from "sentence.He": 65-2E-0D-0A-48 I am pretty sure I am using the std analyzer Here's how I add a doc to the index (oc is String containing th

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
Ilya, StandardAnalyzer treats all forms of newline as whitespace, and doesn't join tokens across whitespace. Can you look at your original text using a hex editor (or something like it, e.g. Unix "od")? Check which character is actually inbetween "electricity" and "this", and "pain." and "ele

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Ilya Zavorin
Steve, Thanks much for the link: very useful! I looked at the index and found that it contains terms like electricitythis -- from Doc 3 pain.electricity -- from Doc 1 sentence.he -- from Doc 1 It appears that there is some sort of issue with handling end-of-lines. What do I need to change at

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
Hi Ilya, What analyzers are you using at index-time and query-time? My guess is that you're using an analyzer that includes punctuation in the tokens it emits, in which case your index will have things like "sentence." and "sentence?" in it, so querying for "sentence" will not match. Luke can

can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Ilya Zavorin
I am writing a Lucene based indexing-search app and testing it using some simple docs and querries. I have 3 simples docs that are shown at the bottom of the this email between pairs of "==="s and about a dozen terms. One of them is "electricity". As you can see, it appears in al