On 3/26/2012 at 12:21 PM, Ilya Zavorin wrote:
> I am not seeing anything suspicious. Here's what I see in the HEX:
>
> "n.e" from "pain.electricity": 6E-2E-0D-0A-0D-0A-65
> (n-.-CR-LF-CR-LF-e) "e.H" from "sentence.He": 65-2E-0D-0A-48
I agree, standard DOS/Windows line endings.
> I am pretty sure
Steve,
I am not seeing anything suspicious. Here's what I see in the HEX:
"n.e" from "pain.electricity": 6E-2E-0D-0A-0D-0A-65 (n-.-CR-LF-CR-LF-e)
"e.H" from "sentence.He": 65-2E-0D-0A-48
I am pretty sure I am using the std analyzer
Here's how I add a doc to the index (oc is String containing th
Ilya,
StandardAnalyzer treats all forms of newline as whitespace, and doesn't join
tokens across whitespace. Can you look at your original text using a hex
editor (or something like it, e.g. Unix "od")? Check which character is
actually inbetween "electricity" and "this", and "pain." and "ele
Steve,
Thanks much for the link: very useful!
I looked at the index and found that it contains terms like
electricitythis -- from Doc 3
pain.electricity -- from Doc 1
sentence.he -- from Doc 1
It appears that there is some sort of issue with handling end-of-lines. What do
I need to change at
Hi Ilya,
What analyzers are you using at index-time and query-time?
My guess is that you're using an analyzer that includes punctuation in the
tokens it emits, in which case your index will have things like "sentence." and
"sentence?" in it, so querying for "sentence" will not match.
Luke can
I am writing a Lucene based indexing-search app and testing it using some
simple docs and querries. I have 3 simples docs that are shown at the bottom of
the this email between pairs of "==="s and about a dozen terms.
One of them is "electricity". As you can see, it appears in al