RE: can't find common words -- using Lucene 3.4.0

2012-03-28 Thread Ilya Zavorin
)); IndexWriter writer = new IndexWriter(dir, iwc); Anything suspicious here? Thanks Ilya Zavorin -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Monday, March 26, 2012 1:48 PM To: java-user@lucene.apache.org Subject: RE: can't find common

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
On 3/26/2012 at 12:21 PM, Ilya Zavorin wrote: > I am not seeing anything suspicious. Here's what I see in the HEX: > > "n.e" from "pain.electricity": 6E-2E-0D-0A-0D-0A-65 > (n-.-CR-LF-CR-LF-e) "e.H" from "sentence.He": 65-2E-0D-0A-48 I agree, standard DOS/Windows line endings. > I am pretty sure

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Ilya Zavorin
anks, Ilya -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Monday, March 26, 2012 11:41 AM To: java-user@lucene.apache.org Subject: RE: can't find common words -- using Lucene 3.4.0 Ilya, StandardAnalyzer treats all forms of newline as whitespace,

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
orin [mailto:izavo...@caci.com] Sent: Monday, March 26, 2012 11:21 AM To: java-user@lucene.apache.org Subject: RE: can't find common words -- using Lucene 3.4.0 Steve, Thanks much for the link: very useful! I looked at the index and found that it contains terms like electricitythis -- from D

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Ilya Zavorin
analyzers for respective foreign texts Thanks, Ilya -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Monday, March 26, 2012 10:59 AM To: java-user@lucene.apache.org Subject: RE: can't find common words -- using Lucene 3.4.0 Hi Ilya, What analyzers are you

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
Hi Ilya, What analyzers are you using at index-time and query-time? My guess is that you're using an analyzer that includes punctuation in the tokens it emits, in which case your index will have things like "sentence." and "sentence?" in it, so querying for "sentence" will not match. Luke can